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Abstract 


We  establish  and  explore  a  new  connection  between  two  general  on-line  scenarios  deriving  from  two  his¬ 
torically  disjoint  communities.  Though  the  problems  are  inherently  similar,  the  techniques  and  questions 
developed  for  these  two  scenarios  are  very  different.  From  competitive  analysis  comes  the  problem  of  met¬ 
rical  task  systems ,  where  the  algorithm  is  to  decide  in  which  state  to  process  each  of  several  sequential  tasks, 
where  each  task  specifies  the  processing  cost  in  each  state,  and  the  algorithm  must  pay  according  to  a  metric 
to  move  between  states.  And  from  machine  learning  comes  the  problem  of  predicting  from  expert  advice  — 
that  is,  of  choosing  one  of  several  experts  for  each  query  in  a  sequence  without  doing  much  worse  than  the 
best  expert  overall. 

The  dissertation  includes  four  results  touching  on  this  connection.  We  begin  with  the  first  metrical  task 
system  algorithm  that  can  guarantee  for  every  task  sequence  that  the  ratio  of  its  expected  cost  to  the  cheapest 
way  to  process  the  sequence  is  only  polylogarithmic  in  the  number  of  states.  Then  we  see  how  we  can  use 
expert-advice  results  to  combine  on-line  algorithms  on-line  if  there  is  a  fixed  cost  for  changing  between  the 
on-line  algorithms.  The  third  result  establishes  new  expert-advice  algorithms  deriving  from  metrical  task 
system  research;  in  addition  to  establishing  theoretical  bounds,  we  compare  the  algorithms  empirically  on  a 
process  migration  scenario.  Finally,  we  investigate  a  modified  version  of  paging,  where  we  want  to  do  well 
against  an  adversary  who  is  allowed  to  ignore  a  paging  request  cheaply. 
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Chapter  1 
Introduction 


1.1  Summary 

Beginning  in  the  mid-1980s,  researchers  of  theoretical  computer  science  began  investigating  the  analysis 
of  on-line  algorithms  —  that  is,  algorithms  that  commit  to  actions  as  they  receive  events.  An  on-line 
problem  defines  the  types  of  events  and  actions  that  the  on-line  algorithm  can  use.  Computer  technology 
inspires  a  wide  variety  of  problems  that  fall  into  this  framework,  including  caching,  dynamic  lists,  real-time 
compression,  and  call  routing. 

Researchers  soon  became  interested  in  abstractions  to  encompass  a  variety  of  on-line  problems.  Among 
these  were  two  very  prominent  problems:  the  metrical  task  system  (MTS)  problem  [BLS92]  and  the  A*- 
server  problem  [MMS90].  This  thesis  begins  with  the  metrical  task  system  problem  —  the  simpler  of  the 
two  —  where  the  algorithm  is  to  decide  in  which  state  to  process  each  of  several  sequential  tasks,  where 
each  task  specifies  the  processing  cost  in  each  state,  but  changing  states  also  has  a  cost  according  to  a  metric. 
In  particular,  we  are  concerned  with  how  we  can  use  randomization  so  that  regardless  of  the  event  sequence, 
our  on-line  algorithm’s  expected  cost  is  not  too  many  times  the  optimal  cost  for  servicing  the  sequence. 

Independently,  in  the  mid-1990s,  researchers  interested  in  machine  learning  became  interested  in  the 
following  scenario:  The  on-line  learner  sees  a  sequence  of  examples  and  wants  to  predict  each  example’s 
label  before  seeing  the  true  label.  The  hope  is  that  the  learner  will  make  few  mistakes  as  it  sees  more  and 
more  examples  with  their  corresponding  labels.  This  is  termed  the  Experts  problem. 

This  thesis  demonstrates  how  one  particular  problem  arising  from  metrical  task  systems  is  intertwined 
with  another  particular  problem  arising  from  the  Experts  problem.  This  connection  forms  the  foundation 
of  this  dissertation,  on  which  we  build  four  primary  results.* 


*Most  of  the  work  appearing  in  this  thesis  originally  appeared  in  papers  by  Blum  and  Burch  [BB97];  Bartal,  Blum,  Burch,  and 
Tomkins  [BBBT97];  and  Blum,  Burch,  and  Kalai  [BBK99].  The  author  would  like  to  recognize  his  coauthors,  Avrim  Blum,  Yair 
Bartal,  Andrew  Tomkins,  and  Adam  Kalai,  who  share  equally  in  the  development  of  these  concepts.  Besides  this  chapter.  Sections 
2.1, 2.2,  3.1,  and  3.2  describe  background  material  to  put  this  work  in  context. 
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A  polylogarithmic  MTS  algorithm 

Using  on-line  learning  algorithms,  we  construct  an  algorithm  guaranteeing  that  the  ratio  of  its  expected 
cost  to  the  optimal  cost  (were  we  to  know  the  entire  sequence  in  advance)  is  only  polylogarithmic  in  the 
number  of  states.  In  particular,  our  algorithm  guarantees  that  the  on-line  algorithm's  expected  cost  is  no 
more  than  0(log‘  ?f  log  log??)  times  the  optimal  cost  knowing  the  sequence  in  advance.  Using  a  much 
less  intuitive  technique  originating  from  more  traditional  on-line  algorithms  research,  we  can  guarantee  an 
0(log5  it  log  log  7; )  bound.  These  represent  the  historically  first  polylogarithmic  guarantees  for  the  met¬ 
rical  task  system  problem.  (By  refining  these  techniques  further,  Fiat  and  Mendel  describe  an  algorithm 
guaranteeing  an  expected  cost  of  at  most  0(log2  n  log2  log  n ) )  times  optimal  [FMOO].) 

This  result  —  and  its  incorporation  of  the  concepts  of  metric  space  approximation,  unfairness,  and  con¬ 
nections  to  machine  learning  —  form  the  launching  point  of  the  dissertation.  Understanding  these  concepts 
and  their  connection  to  the  metrical  task  system  problem  is  the  goal  of  the  first  part  of  the  thesis.  Chapters  2- 
4. 

In  Chapter  2,  we  learn  how  any  metric  space  can  be  approximated  by  what  are  called  HST  spaces,  a 
recent  result  from  Bartal  [Bar96,  Bar98].  We  also  view  a  generalized  form  of  MTS,  called  the  unfair  MTS 
problem,  that  allows  us  to  build  recursive  algorithms  for  HSTs.  This  analysis  indicates  what  sort  of  guarantee 
we  want  from  our  unfair  MTS  algorithm.  We  immediately  see  that  this  guarantee  implies  a  substantial  first 
step  toward  the  poly  log  (n)  result. 

Chapter  3  explains  the  machine  learning  problem  called  predicting  from  expert  advice  [LW94,  FS97]. 
This  problem  is  closely  related  to  the  unfair  MTS  problem,  as  we  demonstrate  by  taking  an  expert-advice 
algorithm  Share  and  using  it  for  the  unfair  MTS  problem  to  get  the  bound  desired  from  Chapter  2. 

Chapter  4  picks  up  from  Chapter  2  again,  describing  an  alternative  algorithm  Odd-Exponent  achieving 
this  same  bound,  and  showing  how  to  use  Odd-Exponent  in  a  more  complicated  way  to  get  the  poltjlofj(n ) 
ratio.  We  observe  that  the  same  techniques  work  with  Share,  although  at  the  loss  of  an  0(log2  /?)  factor. 

The  second  part  of  the  dissertation,  Chapters  5-7,  extends  the  concepts  for  the  polylogarithmic  bound 
(especially  the  connection  to  machine  learning)  to  get  the  other  three  main  results  of  the  thesis. 

Combining  on-line  algorithms 

Chapter  5  discusses  a  problem  called  combining  on-line  algorithms  on-line ,  where  we,  as  the  on-line  al¬ 
gorithm,  have  a  number  of  on-line  algorithms  which  we  might  follow,  but  changing  our  current  on-line 
algorithm  has  a  cost.  This  algorithms  might,  for  example,  incorporate  a  number  of  heuristics  which  do  well 
on  particular  event  sequences,  in  case  the  actual  event  sequence  matches  one  of  our  heuristics.  Using  Ex¬ 
perts  results,  we  see  how  we  can  guarantee  that  our  on-line  combination  algorithm  can  do  almost  as  well  as 
the  best  of  several  on-line  algorithms  whose  performance  we  can  see. 

We  also  see  how  an  on-line  algorithm  might  do  if  it  can  only  see  the  performance  of  its  current  heuristic. 
For  example,  this  might  happen  in  process  migration:  We  can  have  a  heuristic  for  each  computer,  telling  the 
process  to  stay  at  that  computer.  But  if  the  process  can  only  read  the  load  average  at  its  current  location, 
it  sees  only  its  current  heuristic's  performance.  Even  if  it  can  see  only  its  current  selection,  our  on-line 
algorithm  can  guarantee  that  it  does  not  pay  much  more  than  if  it  knew  in  advance  which  heuristic  pays 
least. 

Relating  metrical  task  systems  and  expert  advice 

In  Chapter  6,  we  extend  Chapter  3  by  looking  at  the  converse  direction  —  using  unfair  MTS  algorithms  for 
the  expert  advice  problem.  In  particular,  while  Chapter  3  explains  that  some  Experts  algorithms  also  make 
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qo  qi  (i2  (j3 
T1  =  (3,1,1,  0) 
T2  =  (7,  0,  4,  3) 


Figure  1 . 1 :  A  metric  space  and  task  sequence. 


good  unfair  MTS  algorithms,  Chapter  6  proves  that  any  algorithm  with  an  MTS  guarantee  implies  a  similar 
algorithm  with  an  Experts  guarantee. 

To  get  a  feel  for  the  variety  of  algorithms  this  implies  for  Experts,  we  look  at  the  results  of  a  small 
experiment  comparing  how  different  MTS  algorithms  perform  on  a  sample  of  process  migration  data. 

The  unfair  paging  problem 

The  final  direction  we  take  is  to  extend  the  notion  of  unfairness,  which  we  employed  in  our  analysis  of  the 
MTS  problem,  to  paging.  In  particular,  we  compare  the  on-line  algorithm’s  performance  against  the  cost  of 
servicing  the  request  sequence  if  we  increase  the  power  of  the  off-line  algorithm  by  allowing  it  to  ignore  the 
request  at  a  cost  of  1/r.  We  see  an  on-line  algorithm  that  guarantees  that  it  pays  no  more  than  0(r  +  log  k) 
times  the  best  off-line  cost  computed  with  this  added  power.  (Here  k  represents  the  cache  size.) 

In  Chapter  7,  we  see  the  significance  of  the  problem  and  how  machine  learning  can  be  used  to  achieve 
improved  results  for  it.  Besides  the  significance  of  this  problem  to  paging,  this  work  can  also  be  seen  as 
a  first  effort  at  using  the  techniques  used  for  the  polylogarithmic  guarantee  for  metrical  task  systems  to 
achieve  similar  guarantees  for  the  much  more  challenging  k- server  problem. 


1.2  The  metrical  task  system  problem 

The  initial  problem  motivating  this  work,  and  a  major  focus  of  this  thesis,  is  the  metrical  task  system 
(MTS)  problem  due  to  Borodin,  Linial,  and  Saks,  designed  to  abstract  a  wide  variety  of  on-line  problems 
[BLS92]. 

Problem  MTS  ([BLS92])  We  live  in  a  system  of  n  states  with  a  distance  metric  d  separating  the 
states.  This  distance  metric  is  nonnegative  (d(u,  v )'  >  0),  is  symmetric  (d(u.  v)  —  d(v,  u)),  and  has 
the  triangle  inequality  (d(u,  v)  +  d(v,  w)  >  d(u,  w)).  At  all  times  we  occupy  a  single  state.  At  the 
beginning  of  each  time  step,  we  receive  a  task  vector,  specifying  a  nonnegative  cost  for  each  state 
(representing  our  cost  if  we  process  the  task  in  that  state).  When  we  receive  a  task  vector  T,  we 
choose  whether  to  stay  at  our  current  state  or  to  move  to  a  different  state.  We  pay  both  for  moving 
between  states  (according  to  d)  and  for  processing  the  task  (according  to  T  at  our  new  state).  Our 
goal  is  to  minimize  our  total  cost  over  the  task  sequence. 

Example  1.1  Consider  the  metric  d  and  task  sequence  T  illustrated  in  Figure  1.1.  On  T1^  we  may 
choose  to  process  the  task  in  state  q-2  and  so  pay  =  1  to  process.  Then  say  we  choose  to  process 

tThis  dissertation  uses  superscripts  not  only  for  exponentiation  but  also  for  indexing  time.  To  relieve  ambiguity,  time-indexed 
variables  appear  in  boldface. 
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T2  in  state  </3.  We  pay  r/(r/2.  r/3)  =  5  to  move  and  T2  =  3  to  process  the  task.  Our  total  cost  on  this 
sequence,  then,  is  1  +  (5  +  3)  =  9.  (We  have  chosen  sub-optimally:  The  optimal  choice  is  to  stall 
at  (ji  and  remain  there,  for  a  total  cost  of  J  +  (0  +  0)  =  i .) 

The  importance  of  metrical  task  systems  lies  in  the  fact  that  they  generalize  many  natural  on-line  prob¬ 
lems.  The  following  three  examples  illustrate  this. 

Example  1.2  Laptop  computer  power  management  inspires  the  following  very  simple  task  system. 

The  states  are  c/q,  representing  that  the  laptop's  hard  drive  is  not  spinning,  and  r/j ,  representing  that  it 
is.  The  distance  between  the  states  is  half  the  amount  of  power  required  to  begin  spinning  the  disk. 

(We  use  half  because  to  be  a  metric  the  distance  function  must  be  symmetric.  We  are  optimizing  on 
the  total  cost:  Each  time  we  move  from  r/0  to  (j\ ,  we  will  later  move  from  q \  to  <y0;  by  using  half 
each  time,  we  add  the  full  amount  to  the  total.)  On  all  time  steps,  the  cost  to  r/j  is  the  amount  of 
power  to  keep  the  disk  spinning.  For  time  steps  where  there  is  no  disk  access,  the  cost  to  <y()  is  0,  but 
when  there  is  a  disk  access,  the  cost  to  %  is  infinite  to  prevent  an  on-line  player  from  being  in  r/()  for 
the  task.  (Helmbold,  Long,  and  Sherrod  consider  laptop  disk  management  as  a  practical  problem  to 
be  approached  using  machine  learning  theory  [HLS96].  We  relate  machine  learning  theory  to  task 
systems  in  Chapter  3.) 

Example  1.3  Say  we  have  a  computational  process  that  can  move  on  a  network  between  computers 
with  varying  loads.  In  metrical  task  systems,  the  costs  should  represent  the  quantity  we  want  to 
minimize,  and  in  this  case  we  want  to  avoid  lost  computation  time.  So  the  metric  gives  the  lost  time 
involved  in  transporting  the  process  from  one  computer  to  another.  And  on  each  time  step,  the  task 
vector  tells  us  for  each  computer  how  much  time  would  have  been  lost  were  we  at  that  computer. 
(Section  6.3  describes  an  experiment  comparing  different  MTS  algorithms  using  computer  load 
data.) 

Example  1.4  Paging  can  be  formulated  in  the  metrical  task  system  framework.  If  we  have  a  cache 
that  can  hold  k  pages,  and  there  are  n  pages  in  the  universe,  then  the  task  system  would  include  a 
state  for  each  of  the  (7/)  choices  of  k  pages  from  the  universe.  Our  current  state  tells  us  what  we 
should  hold  in  our  cache.  We  represent  a  request  to  a  page  i  as  a  task  with  a  cost  of  0  for  those  states 
where  i  is  in  the  state’s  corresponding  cache  and  oc  elsewhere.  The  distance  between  two  states  is 
the  number  of  page  loads  required  to  move  between  the  two  states’  corresponding  sets.  (The  MTS 
results  in  this  thesis  unfortunately  say  nothing  useful  about  Paging,  as  the  number  of  states  is  much 
too  large  to  generate  useful  bounds.  But  Chapter  7  describes  how  the  techniques  used  for  the  MTS 
results  of  this  thesis  can  apply  to  Paging.) 

Some  definitions  will  help  us  discuss  task  systems.  An  event  sequence  (or  task  sequence)  T  is  the  time- 
indexed  sequence  of  task  vectors.  An  action  sequence  is  a  time-indexed  sequence  of  states  specifying  where 
each  task  is  processed;  in  Example  1.1,  the  action  sequence  v  is  <7/2,  <y3).  The  movement  cost  move  (v)  is 
the  total  cost  incurred  according  to  the  metric,  Ylt  r/(vf“!,  v*).  The  local  cost  (or  task-processing  cost) 
local ( T,  v)  is  the  total  cost  incurred  according  to  the  task  vectors,  .  Thus  the  total  cost  ro.s7(T,  v) 

for  v  on  T  is  move  (v)  +  local ( T,  v). 


1.3  Competitive  ratio 

In  the  MTS  problem,  as  with  other  on-line  problems,  the  competitive  ratio  proves  a  useful  performance 
measure  of  an  algorithm.  Informally,  this  is  the  maximum,  over  all  event  sequences  T,  of  the  ratio  of  the 
algorithm’s  cost  on  T  against  the  best  possible  cost  for  servicing  T.  In  Example  1.1,  this  ratio  is  9/1.  (But 
of  course,  since  we  looked  at  only  one  event  sequence  (and  not  all  possible  event  sequences),  this  is  not 
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really  a  competitive  ratio.)  Sleator  and  Tarjan  proposed  this  competitive  ratio  as  a  general  technique  for 
analyzing  on-line  algorithm  performance  [ST85a]. 

Example  1.5  A  tourist  visiting  New  York  City  for  a  day  can  pay  $1.50  for  a  single  subway  trip  and 
$4.00  for  an  all-day  pass.  A  simple  strategy  employed  by  many  tourists  is  to  simply  buy  the  $4.00 
pass  at  the  first  subway  ride,  at  a  cost  of  $4.00.  This  has  a  poor  competitive  ratio,  since  if  it  is  also 
the  last  ride,  the  ratio  is  4/1.5  =  2.667.  An  alternative  strategy  is  to  buy  single-trip  tokens  for  the 
first  two  rides  and  the  all-day  pass  for  the  third.  For  this,  the  worst-case  ratio  is  7/4  =  1.75,  which 
occurs  if  the  tourist  takes  three  rides. 

Example  1.5  illustrates  that  the  competitive  ratio  is  not  always  the  most  intuitive  way  of  looking  at  the 
problem.  If  our  tourist  were  quite  sure  she  would  use  the  subway  more  than  twice,  perhaps  she  should 
have  bought  the  all-day  pass  initially.  Or  if  our  tourist  brought  only  $5.00,  she  may  want  the  all-day  pass. 
The  advantage  of  the  competitive  ratio  bound  is  that  it  applies  to  many  on-line  problems  without  requir¬ 
ing  additional  input  requirements  (like  a  probability  distribution)  to  the  problem.  Additionally,  theoretical 
comparisons  using  competitive  ratios  often  agree  with  empirical  comparisons  in  how  they  rank  algorithms. 
(Empirically,  the  ratios  tend  to  be  much  lower  since  inputs  generally  are  not  adversarial). 

Additional  research  refined  the  notion  of  competitive  ratio  slightly  to  incorporate  randomization  and  to 
provide  an  additive  fudge  factor.  We  say  randomized  algorithm  A  is  /^-competitive  if  for  any  task  sequence, 
the  expected  cost  to  A  is  at  most  p  times  the  best  achievable  cost  for  the  task  sequence  (plus  a  constant 
independent  of  the  sequence).  More  formally,  given  a  metric  space  d,  an  on-line  algorithm  .4  has  competitive 
ratio  p  if  for  some  constant  b,  for  each  event  sequence  T,  A  outputs  an  action  sequence  V4  (a  random  variable 
if  .4  is  randomized)  so  that  for  all  action  sequences  v,  the  cost  to  A  obeys  the  inequality 

E[co.sf(T,  V4)]  <  p  cost(T,\)  +  b .  (1.1) 

The  additive  part  b  proves  to  be  an  important  (and  irritating)  detail.  Thus  we  frequently  speak  of  A  as 
having  “ratio  p  with  additive  6.” 

The  way  the  quantifiers  are  ordered  in  this  definition  assumes  an  oblivious  adversary;  an  adversary 
choosing  the  worst-case  T  must  choose  the  entire  sequence  without  knowing  .4’s  particular  choices.  This  is 
appropriate  in  circumstances  where  the  algorithm  has  a  negligible  effect  on  the  environment  —  such  as  in 
paging  (usually)  and  in  small-quantity  stock  investing.  An  alternative  is  to  use  an  adaptive  adversary  who 
can  choose  each  task  vector  knowing  .4’s  random  choices  so  far  [BDBK+94].  But  throughout  this  thesis  we 
use  an  oblivious  adversary  for  all  our  on-line  problems. 

One  very  nice  aspect  of  analyzing  algorithms  against  oblivious  adversaries  is  the  simplicity  of  expressing 
the  cost  in  the  uniform  metric  (where  all  interstate  distances  are  1).  If  p'-1  is  our  current  probability 
distribution,  and  we  move  to  distribution  p  in  order  to  process  the  task  T/,  define  d  (p<_1 ,  p )  to  be 

(P*_1  -  P<)  =  X]  (Pi  -  Pi"1)  ' 

i:p‘-1>P,‘  *:P*-1<P* 

(Since  both  p*- 1  and  p*  are  probability  distributions  and  so  sum  to  1 ,  these  quantities  are  equal.)  For  the  task 
T',  our  expected  cost  is  exactly  d  (p#_1,  p*)  +  Yli  P ;T|.  It  is  convenient  to  think  of  probability  as  a  fluid 
being  moved  between  states  as  time  progresses,  where  the  movement  cost  between  time  steps  is  the  amount 
of  fluid  being  transferred.  Indeed,  we  can  redefine  the  MTS  problem  as  the  on-line  algorithm  choosing  a 
probability  distribution,  with  the  costs  as  just  described,  and  so  avoid  the  intricacies  of  probability  altogether. 

Theorem  1.1  Say  we  are  in  the  uniform  metric.  We  can  change  our  state  probability  distribution 
from  p  ~ 1  to  p/  at  an  expected  cost  ofd  (p1-1 ,  p*). 
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Proof.  If  the  probability  we  are  at  our  actual  current  state  i  should  increase  (i.e.  .  p'-‘  <  p'). 
then  we  do  not  move.  But  if  that  probability  decreases  (p|  1  >  pj),  then  we  remain  at  i  with 
probability  p\/pf~]  and  otherwise  choose  randomly  from  among  the  states  j  whose  probabilities 

increase,  choosing  with  probabilities  (V,  -  P;-1)  fd  (p^'iP1). 

The  new  probability  distribution  with  this  strategy  is  p^.  For  decreasing-probability  states  /,  the 
probability  we  are  there  is  the  product  of  the  chance  we  were  already  there  (pj-1 )  and  the  chance 
we  remain  there  given  we  were  already  there  (p-/p|_1),  and  this  product  is  p-.  There  is  no  chance 
that  we  move  to  L  For  increasing-probability  states  ■/,  we  are  there  if  we  move  to  i  or  if  we  were 
at  i  already.  The  probability  we  move  there  from  a  decreasing-probability  state  j  is  the  product  of 
the  chance  we  were  at  j  (which  is  p'"1),  the  chance  we  move  from  j  given  we  were  there  (which 

-  (pr1  -  p.0  /p.r!x  and  the  chance  we  move  to  i  given  that  we  are  moving  from  j  (which  is 

(pi-  -  P;_1)/rf  (p'_1-p'))-  This  Product  is  (p'_1  -  p')  (p)  -  p'-1)  /A  (p'_1.p').  Summing 

over  all  such  j  gives  us  p-  —  p|_1 .  We  could  also  have  already  been  at  state  i  (and  remained  there) 
with  probability  p)-1 , for  a  total  probability  of  p  ■ . 

To  get  the  total  probability  we  move,  we  sum  the  chances  that  we  move  to  each  state.  For 
decreasing-probability  states,  this  chance  is  0.  For  increasing  probability  states  /,  we  have  already 
seen  that  the  chance  we  move  there  is  pj  —  p  ” 1 .  Summing  over  all  states  gives  us  <7  (p'_l ,  p' ) .  ■ 

A  major  open  problem  in  competitive  analysis  is,  “How  small  a  competitive  ratio  can  one  guarantee 
for  metrical  task  systems  on  arbitrary  distance  metrics?”  A  primary  goal  of  this  dissertation  is  to  present  a 
substantially  improved  answer  to  this  question. 


1.4  Previous  results 

Uniform  metric 

The  simplest,  most  important,  and  best-understood  metric  for  task  systems  is  the  uniform  metric,  where 
d{u,  v)  =  1  when  u  ^  v  (and  d(n,  v)  =  0  for  all  »). 

The  Marking  algorithm  of  Borodin,  Linial,  and  Saks  is  a  simple  and  useful  algorithm  for  the  uniform 
metric  [BLS92].  (This  algorithm  is  similar  to  the  Marking  algorithm  used  for  Paging,  which  we  review  in 
Chapter  7  [FKL+91].) 

Algorithm  Marking  ([BLS92])  The  algorithm  proceeds  in  phases.  At  the  beginning  of  each  phase 
all  states  are  unmarked,  and  Marking  chooses  a  uniform-random  state  to  occupy.  As  tasks  are 
received,  Marking  increases  counters  on  each  state,  keeping  track  of  the  total  processing  cost  for 
the  state  in  this  phase.  (This  counter  will  increase  when  the  state  incurs  a  cost,  whether  or  not  the 
algorithm  occupies  it.)  When  a  state’s  counter  reaches  1,  we  say  that  this  state  is  marked.  When  the 
current  state  becomes  marked,  the  algorithm  moves  to  a  random  unmarked  state.  When  all  states  are 
marked,  Marking  resets  all  marks  and  counters  and  begins  a  new  phase. 

Example  1.6  Consider  3  states  ry0,  (]\ ,  and  q2,  where  Marking  begins  at  ry0,  with  the  task  sequence 

(Jo  (J\  <72 
T1  =  (0.5. 0.2.  0.0) 

T2  =  (0.2. 0.3. 2.0) 

T3  =  (0.0. 1.0. 1.0) 

T'1  =  (1.0. 0.0. 0.0) 
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Marking  initially  chooses  a  random  state  —  say  it  chooses  qi  and  so  pays  0.2  for  T1 .  The  counters 
are  now  (0.5, 0.2, 0).  On  T2,  the  counters  become  (0.7,  0.5, 2);  is  now  marked,  but  Marking  is 
at  q\  and  so  remains  there,  at  a  cost  of  0.3.  On  T3,  the  counters  become  (0.7, 1.5, 3).  The  current 
state  q\  is  now  marked;  the  algorithm  chooses  randomly  from  the  unmarked  states  {</o},  so  Marking 
must  choose  </o,  at  a  cost  of  1  +  0.  On  T4,  all  states  become  marked;  Marking  clears  all  counters 
and  chooses  a  random  state,  say  q 2.  The  cost  is  1  +  0;  Marking’s  total  cost  for  these  four  tasks  is 
0.2  +  (0  +  0.3)  +  (1  +  0)  +  (1  +  0)  =  2.5. 

The  following  theorem  bounds  the  competitive  ratio  of  Marking.  Achlioptas,  Chrobak,  and  Noga 
demonstrate  the  best  possible  bound  for  Marking,  2 Hn  —  1  [ACN96].+ 

Theorem  1.2  ([BLS92])  Marking  has  competitive  ratio  2 Hn  for  uniform  metric  spaces. 

Proof.  We  analyze  by  phases.  Any  action  sequence  taken  by  an  off-line  algorithm  must  pay  at  least 
1  in  each  phase  (either  1  to  move  or  1  if  it  stays  in  the  same  state);  we  argue  that  Marking’s  expected 
cost  is  at  most  2 Hn.  Consider  the  first  state  to  become  marked.  The  probability  that  Marking  ever 
goes  to  this  state  during  the  phase  is  y,  and  if  so  then  Marking  pays  at  most  2  for  this  state  (at  most 
1  to  move  there,  and  at  most  1  in  local  costs).  Thus  the  expected  cost  to  Marking  at  this  state  is  at 
most  y.  Now  consider  the  second  state  to  become  marked.  The  probability  that  Marking  ever  goes 
to  this  state  is  at  most  y^y,  and  if  so  then  Marking  pays  at  most  2  for  this;  thus  the  expected  cost 
to  Marking  at  this  state  is  at  most  — y.  Generally,  at  the  7th  state  to  become  marked  in  the  phase. 
Marking  expects  to  pay  at  most  r)_2+1  at  that  state.  We  sum  over  all  states  to  get  2 Hn.  ■ 

On  the  other  side,  we  know  that  no  algorithm  can  guarantee  a  competitive  ratio  of  less  than  Hn  .  Irani 
and  Seiden  nearly  match  this  lower  bound  with  an  algorithm  achieving  the  ratio  Hn  +  0(  \/log  n)  [IS98]. 

Theorem  1.3  ([BLS92])  Every  on-line  algorithm  for  the  uniform  metric  has  a  competitive  ratio  of 
at  least  Hn. 

Proof.  Consider  the  following  sequence  constructed  by  an  adversary  who  maintains  the  probability 
distribution  on  states  used  by  the  on-line  algorithm  A.  The  sequence  proceeds  in  phases.  The  first 
task  vector  of  the  phase  is  0  on  all  but  the  most-probable  state  q\,  where  it  is  infinite.  Since  A  is  at  <71 
with  probability  at  least  y,  and  it  will  pay  1  to  move  from  q\  to  avoid  the  infinite  cost,  the  expected 
cost  to  A  is  at  least  y.  The  second  task  vector  is  0  everywhere  except  for  </i  and  the  most-probable 
state  q2-  The  expected  cost  on  this  task  is  at  least  y^y .  We  continue  this  until  we  reach  n  —  1  tasks, 
each  time  using  task  vectors  that  are  0  everywhere  except  at  </i, . . .  ,  qt-\  and  the  most-probable 
state  qi.  The  total  cost  to  A  after  these  tasks  is  at  least  //,,  —  1.  For  the  final  task  vector  of  this  phase, 
we  give  a  cost  of  1  to  the  remaining  state  qn  and  0  elsewhere;  since  .4  must  be  at  qn,  the  cost  is  1  for 
a  total  cost  of  at  least  Hn  to  A. 

An  off-line  algorithm  knowing  the  sequence  would  be  at  qn  for  the  first  n  -  1  tasks,  at  no  cost; 
on  the  nth  task,  it  would  move  to  the  next  phase’s  qn,  at  a  cost  of  1.  Since  the  algorithm  can  repeat 
these  phases  indefinitely,  the  competitive  ratio  of  A  is  at  least  Hn.  ■ 

General  metrics 

The  situation  for  arbitrary  metrics  is  more  challenging.  In  the  metric  space  of  Figure  1.2,  for  example. 
Marking  does  very  poorly  —  it  will  likely  pay  at  least  100  in  most  phases.  A  more  promising  alternative  for 
metric  spaces  like  that  of  Figure  1 .2  is  to  merge  <70  and  q\  somehow  and  to  combine  this  qo-qi  combination 
with  </2  using  some  algorithm  like  Marking  —  that  is,  to  use  Marking  to  combine  q0  and  qi  in  isolation, 

!By  H„,  we  mean  the  nth  haimonic  number,  J27=  1  7- 
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Figure  1.2:  A  decidedly  nonuniform  metric  space. 

and  then  to  use  Marking  again  to  incorporate  cj2  into  the  mixture.  Karlin  et  al.  consider  the  case  of  such 
an  unbalanced  3-point  space  [KMMO90];  for  larger  unbalanced  spaces,  Blum  et  al.  apply  this  principle  of 
building  from  algorithms  for  subspaces  [BKRS92],  This  decomposition  of  a  space  into  subspaces  is  also  the 
inspiration  behind  the  approach  followed  in  this  dissertation. 

Many  of  the  known  algorithms,  including  many  seen  in  this  dissertation,  use  the  work  function.  The 
work  function  OPT),,  indexed  by  a  time  t  and  a  state  r,  represents  the  optimal  off-line  cost  for  servicing 
the  first  t  tasks  and  ending  in  state  r.  We  can  compute  OPTj.  as  follows.  Initially  OPT1,1  is  0  for  all  v. 
Given  a  task  vector  T#  we  update  each  state’s  work  function  to 

OPT),  =  min  (OPT^f1  +  T'„  +  cl{u.  t’))  . 

U 

Notice  that  OPT,,  and  OPT,,  can  never  differ  by  more  than  r/(  n.  v).  We  say  that  state  u  pins  state  v  when 
OPT|,  =  opt;,  +  d(u,v). 

Besides  introducing  the  problem  and  presenting  Marking,  Borodin,  Linial,  and  Saks  also  demonstrate  a 
deterministic  algorithm  for  general  metric  spaces. 

Algorithm  Work-Function  ([BLS92])  We  maintain  the  work  function.  When  the  state  we  occupy 
becomes  pinned,  we  move  to  the  pinning  state. 

Example  1.7  We  return  to  Example  1.1.  The  work  function  values  are  initially  OPT0  =  (0.0. 0.0). 

We  initially  occupy  state  < /0 ,  and  receive  T1  =  (3. 1, 1. 0).  We  update  our  work  function  values 
to  OPT1  =  (3, 1, 1,0).  Nobody  yet  pins  state  %,  so  we  remain  there,  at  a  cost  of  0  to  move 
and  3  to  process.  Our  second  task  vector  T2  is  (7. 0.  3.3),  so  our  work  function  values  become 
OPT2  =  (5, 1, 5,  3).  Now  state  cji  pins  states  %  and  (j>-  We  are  at  state  %,  so  we  move  to  the 
pinning  state,  cp ,  at  a  cost  of  4  to  move  and  0  to  process.  Our  total  cost  on  this  sequence,  then,  is 
3+ (4  +  0)  =  7. 

Borodin,  Linial,  and  Saks  show  the  following,  not  proven  in  this  thesis. 

Theorem  1.4  ([BLS92])  Work-Function  has  competitive  ratio  2  «  -  1  for  any  metric  space. 

They  complement  this  by  showing  that  deterministic  algorithms  cannot  guarantee  less  than  2 it  —  1 . 

How  much  better  can  one  do  with  randomized  algorithms?  This  remains  a  major  open  question  in  com¬ 
petitive  analysis.  It  was  not  even  clear  that  any  improvement  was  possible  until  Irani  and  Seiden  demon¬ 
strated  a  randomized  algorithm  with  a  mildly  improved  competitive  ratio,  1.58?/  —  0.58  [IS98].  On  the 
lower-bound  front,  Blum  et  al.  show  that  regardless  of  the  metric,  every  algorithm  must  have  a  competitive 
ratio  of  at  least  f?(i/log  ?? /log  log  ??)  [BKRS92]. 

In  the  absence  of  any  satisfying  bounds  closing  this  gap  for  arbitrary  metrics,  researchers  developed 
algorithms  for  some  natural  metrics  beyond  the  uniform  metric.  These  include  an  0(log  ?>  )  ratio  for  “highly 
unbalanced  spaces”  [BKRS92],  an  O(log2  7?)  ratio  for  a  star  space  [Tom97],  and  a  20^l,,gnl°6 luR"*  ratio 
for  equally-spaced  points  on  a  line  [BBF+  90,  BRS97].  (In  a  star  space,  cl  (a.  v)  is  cl  „  +  r/,,  for  some  choices 
dv  of  values  for  states.) 
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These  examples  in  other  metrics  led  to  the  somewhat  daring  conjecture  that  a  general  algorithm  exists 
achieving  0(log  n)  on  every  metric,  and  that  no  metric  exists  where  o(log  n)  is  possible.  This  0(log  n )  al¬ 
gorithm  remains  elusive,  but  an  algorithm,  presented  in  this  dissertation,  achieves  ratio  0(log5  n  log  log  n). 
Fiat  and  Mendel  subsequently  refine  this  to  O  (log2  n  log2  log  n)  [FMOO],  These  polylogarithmic  guarantees, 
coupled  with  the  fi(y/log  ?i/log  log  n)  lower-bound  result  of  Blum  et  al.  [BKRS92],  gives  strong  evidence 
for  the  randomized  MTS  conjecture. 
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Chapter  2 

HST  approximation 


Bartal’s  probabilistic  approximation  of  arbitrary  metric  spaces  with  /i-HSTs  is  a  major  new  tool  in 
optimization  algorithm  research  [Bar96,  Bar98].  The  MTS  problem  was  a  major  motivation  behind  this 
result,  and  the  MTS  result  presented  in  this  dissertation  remains  an  important  application.  In  this  chapter 
we  explore  this  result  and  its  application  to  the  MTS  problem. 
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* 


2.1  Probabilistic  approximation 


The  notion  of  probabilistic  approximation  dates  from  Karp  [Kar90].  A  metric  space  d  is  probabilistically 
approximated  with  ratio  p  by  a  class  C  of  metric  spaces  with  an  associated  distribution  if,  for  every  pair  of 
points  u  and  v  in  d. 


1. 

2. 


For  all  metrics  d  6  C,  we  have  d ( u .  v)  >  d(u,  v)  . 


E 


dec 


cl(u,  i>)  <  p  ■  d.(u,  v)  . 


That  is,  every  edge  expands  (regardless  of  our  choice  of  d,  no  edge  becomes  shorter  than  in  d)  but  its 
expected  expansion  factor  is  not  more  than  p. 

Example  2.1  Karp  uses  a  simple  example  of  probabilistically  2-approximating  an  n-node  cycle 
space  by  a  set  of  ??-node  line  spaces:  Choose  a  random  edge  of  the  cycle  and  split  it  there.  (See 
Figure  2.1.)  No  matter  which  edge  we  pick,  no  distance  shrinks  using  this  approximation.  But  for 
any  adjacent  pair  of  nodes  u  and  v,  the  edge  connecting  them  is  split  with  probability  otherwise 


Figure  2.1:  Approximating  a  cycle  by  a  line. 
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it  remains  intact.  Thus  the  expected  distance  is 


E, 


rf(M-t’) 


< 


l  +  “(»-  1 )  =  2  - 

n 


2 

-  <  2  . 
n 


For  any  nonadjacent  pair  of  nodes,  their  expected  distance  in  d  is  at  most  the  sum  of  the  expected 
lengths  along  edges  in  the  shortest  path  between  them,  and  we  know  that  these  edges  expand  by 
2  —  -  in  expectation. 

The  following  straightforward  theorem  relates  the  concept  of  probabilistic  approximation  to  the  MTS 
problem.  Coupled  with  Example  2.1,  for  example,  it  says  that  an  MTS  algorithm  that  is  ^-competitive  on 
line  spaces  implies  a  2/>-competitive  algorithm  for  cycle  spaces. 

Theorem  2,1  Say  that  we  can  probabilistically  p-approximate  a  metric  space  d  with  a  distribution 
on  a  class  C  of  metric  spaces ,  and  say  we  can  find  an  r -competitive  MTS  algorithm  .1  for  metrics 
from  C.  Then  we  have  an  (rp) -competitive  algorithm  .4  for  d. 

Proof.  Our  algorithm  .4  probabilistically  approximates  d  by  a  metric  d  <E  C  and  then  runs  1  on  d 
using  the  identical  task  sequence.  On  each  step  /,  /I  chooses  to  occupy  whichever  state  that  A 

occupies  within  d. 

Consider  any  action  sequence  v  in  d.  Let  E j[-  •  *]  represent  the  expected  cost  of  .4  relative  to  its 
choice  of  d,  and  let  E  ^  [•  •  •]  represent  the  expected  cost  of  A  given  the  choice  of  <1.  The  expected 
cost  to  A  is 


E; 


E  A 


Erf(vr-vi)+ 


<  E(/ 


E.4 


(The  inequality  holds  because  d(u,  r)  <  d(u.  v)  necessarily.)  The  amount  inside  E^[-  •  •]  on  the  right 
is  exactly  the  expected  cost  to  .4  on  d.  Using  the  fact  that  .4  is  / -competitive,  we  continue. 


Ej 


Ej 

i 

+  T  'y,)+h 

-EK- 

+  ^  +  b 

<  r^Hvw,v')4T;,)  +  i 

i 

<  rp^(r/(v,-1,v')  +  T[,,)+b 


1 

Since  this  inequality  holds  for  any  sequence  v,  A  is  {rp) -competitive. 


2.2  Approximation  with  HSTs 

Bartal’s  contribution  is  to  develop  a  technique  for  approximating  arbitrary  metrics  by  a  special  type  of  space 
particularly  amenable  to  constructing  algorithms,  the  h -hierarchical  well-separated  tree  (h-HST).  Define 
the  diameter  of  a  metric  space  to  be  the  maximum  distance  separating  any  two  points  in  it.  A  metric 
space  with  diameter  A  is  an  /?-HST  metric  if  it  can  be  partitioned  into  subspaces  that  are  recursively  h- HST 
metrics  with  diameters  at  most  A//?,  where  the  distance  between  any  two  points  in  different  subspaces  is 


12 


Figure  2.2:  An  example  of  a  2-HST.  (The  circles  are  points,  and  the  numbers  indicate  diameters  of  subtrees.) 


A.*  The  easiest  way  to  draw  an  /?-HST  is  as  a  tree;  see  Figure  2.2.  In  this  drawing,  the  distance  between 
the  second  point  and  fifth  points  from  the  left  is  2,  since  this  is  the  diameter  of  the  lowest  subtree  containing 
both  points. 

Theorem  2.2  ([Bar98])  For  any  h  >  1,  any  metric  space  of  n  nodes  can  be  probabilistically 
approximated  with  ratio  0(h  log  n  log  log  n)  by  a  distribution  on  h-HSTs. 

Some  of  our  less  sophisticated  results  rely  on  the  number  of  levels  in  the  h  -HST;  in  this  theorem,  the  depth 
of  each  tree  is  0( log^  A),  where  A  is  the  ratio  of  the  longest  distance  to  the  shortest  nonzero  distance  in  d. 

This  theorem  has  many  applications  to  approximation  algorithms  and  on-line  algorithms.  For  many  of 
these  cases,  the  value  of  h  is  irrelevant  and  so  h  is  taken  to  be  simply  1.  But  in  the  MTS  result  we  will  find 
it  necessary  to  take  h  to  be  a  larger  value  (like  0( log  n)). 

Rather  than  look  at  the  proof  of  Theorem  2.2,  for  intuition  we  look  at  a  simplified  result  applying  only 
to  4o  metrics,  and  then  we  briefly  discuss  how  the  same  approach  applies  to  arbitrary  iv  metrics.  (In  an 
too  space,  points  have  coordinates,  and  the  distance  d(u,  v)  between  two  points  u  and  v  is  max*  |u*  -  vt-|, 
where  U{  is  the  ?'th  coordinate  of  point  u .) 

Theorem  2.3  For  any  h  >  1,  any  k-dimensional  t^  space  of  n  nodes  can  be  probabilistically 
approximated  with  ratio  0(hk  log/x  n)  by  h-HSTs. 

Algorithm  Approx-^,  Say  our  metric  space  d  has  diameter  D.  We  construct  our  /?  -HST  by  select¬ 
ing,  for  each  dimension,  a  partition  of  the  axis  into  pieces  of  width  y .  Independently  for  each  axis, 
we  choose  the  offset  of  this  partition  by  choosing  a  number  uniformly  from  [0,  y]  so  that  no  pair 
of  nodes  u,  v  E  d  with  d(u,  v)  <  4A  is  divided.  (That  is,  we  continue  choosing  new  offsets  until 
no  such  pair  is  split  by  our  choice.  Finding  such  a  partition  is  always  possible;  there  are  at  most  y- 
pairs  of  points,  so  at  most  ^  ^  of  the  range  [0,  y]  is  disallowed.)  This  produces  a  partition 

of  the  fc-dimensional  space  into  at  most  (h  +  l)k  nonempty  regions,  which  we  call  divisions.  Our 
/?-HST  will  have  a  recursively-computed  subspace  for  each  division.  We  choose  the  diameter  (that 
is,  the  distance  between  points  in  different  divisions)  to  be  D.  Because  each  division  has  diameter 
at  most  y  (and  so  the  recursively-computed  subspace  has  diameter  at  most  y),  we  get  an  /?-HST. 
Figure  2.3  illustrates  this  technique  on  a  2-dimensional  space  with  h  =  2. 

Proof.  Consider  any  pair  of  nodes  u  and  v  in  our  original  space.  This  pair  will  be  separated  on  some 
level  of  the  tree;  since  the  diameter  D  on  that  level  is  at  least  d(u ,  r),we  satisfy  the  first  requirement 

*  Banal's  definition  of  the  distance  between  two  points  u  and  v  is  different:  Whereas  we  define  it  to  be  the  diameter  of  the  lowest 
subspace  containing  u  and  v,  he  defines  it  as  the  sum  of  this  “diameter"  and  half  the  sum  of  the  “diameters"  of  the  subspaces  in 
each  lower  level  containing  u  or  v  [Bar96].  (Banal's  definition  comes  from  mapping  the  space  to  a  tree  with  lengths  assigned  to 
the  edges  and  points  at  the  leaves.  The  distance  from  u  to  v  is  the  sum  of  edge  lengths  on  the  path  from  u  to  v  in  the  tree.)  Since 
we  always  use  h  >  2,  the  two  definitions  differ  by  only  a  constant  factor. 


Figure  2.3:  Constructing  a  ‘2-HST  for  an  f,x,  space.  (Circles  are  points;  on  the  left,  distances  are  based  on 
the  two-dimensional  coordinates  in  the  diagram,  and  lines  represent  the  partitions.) 


of  a  probabilistic  approximation,  d(  u,  r)  <  D  =  d(  w,  r).  Now  we  consider  the  upper  bound  on  the 
expected  d(u,  v).  The  nodes  u  and  v  will  be  split  on  a  level  of  the  recursion  where  the  diameter  is 
between  d  ( u .  r)  and  n2hd(u,  r).  There  are  at  most  1  +  log  /,  (»2/?)  =  0(log/,  n)  of  these.  For  a  level 
of  recursion  with  a  diameter  D,  for  each  coordinate  the  probability  that  the  partition  splits  u  and  v  is 
at  most  jjjjp .  and  in  this  case  d( it,  r)  is  D.  So  the  expected  contribution  to  the  distance  is  at  most 
2hd(u,  We  sum  over  all  coordinates  to  get  2hkd(u.  v),  and  sum  over  all  ©(log/,  »)  levels  to  get 


E 


d(u,  v 


0(hk log,,  n)d{n.  r) . 


This  approach  generalizes  naturally  to  arbitrary  (),  metrics. 

Theorem  2.4  For  any  It  >  1  and  integer  p  >  1,  any  k -dimensional  (p  metric  space  of  n  nodes  can 
be  probabilistically  0{hk  log/(  (?;  k1  >p))  -approximated  by  h-HSTs. 

Proof.  We  follow  the  method  of  Theorem  2.3,  with  a  few  differences.  When  the  diameter  is  I), 
we  partition  each  axis  into  pieces  of  width  jjjjj,  so  that  the  diameter  of  each  division  is  but  we 
choose  the  offset  so  that  no  point  pair  (u.  r)  with  d(u.  r)  <  —2 -pjjjj,  is  separated.  Consider  any 
pair  of  points  u  and  v.  For  each  coordinate  i ,  let  t;  =  ti ;  —  r,  |.  The  chance  the  pair  is  split  by  the 
partition  on  coordinate  i  when  the  diameter  is  D  is  at  most  jjjjijT/j,  ■  Summing  over  since  (as 

shown  below)  W ;  (,  <  A'1~1/'V/( i/,  ?!),  we  get  at  most  a  2 hkd(u.  v)/D  chance  that  d(u,  v)  =  D. 
Thus  the  expected  value  of  d{u.  v)  is  at  most  0(hk  log/, (itk]/p))d(u.  r). 

To  show  i  (i  <  kx~l!p  (5],.  (f  )1^’,  we  show  (^;  (,)p  <  kp~ 1  Yl;  by  induction  on  p.  It 
trivially  holds  for  p  =  1.  Given  the  fact  for  p  —  1,  we  have  by  induction 

(?")’ s  ‘-fr'K?") 

j 
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The  last  step  follows  since  (this  is  equivalent  to  {ff  1  —£p-  >  0). 


2.3  Recursive  MTS  construction 

Bartal’s  probabilistic  approximation  of  general  metrics  by  HSTs  suggests  a  definite  program  for  achieving 
improved  probabilistic  MTS  algorithms:  We  find  an  algorithm  for  HSTs  and  apply  Theorem  2.1.  Because 
of  their  structure,  a  very  natural  approach  for  tackling  HSTs  is  to  inductively  apply  an  algorithm.  The 
polylog (?/.)  result  described  in  this  dissertation  follows  exactly  this  program. 

A  major  hurdle  is  to  conceive  of  a  good  scenario  to  abstract  the  details  of  algorithms  for  subtrees  of  an 
HST,  so  that  we  can  define  simple  techniques  to  combine  these  into  an  algorithm  for  the  entire  tree  using 
recursion.  The  remainder  of  this  chapter  describes  this  abstraction  and  demonstrates  how  to  apply  it. 

To  inductively  construct  our  algorithm  for  the  entire  HST,  we  imagine  that  we  already  have  /  -competitive 
subalgorithms  for  each  subtree  of  the  root,  and  we  construct  an  algorithm  to  combine  these  into  an  algorithm 
for  the  entire  tree.  We  can  abstract  the  /'-competitiveness  of  the  subalgorithms  by  imagining  that  each  time 
the  task  vector  says  we  pay  5,  in  fact  our  on-line  algorithm  pays  rS.  We  will  compare  it  to  a  player  who  does 
not  incur  this  factor  of  r.  We  call  this  r  the  cost  ratio;  typically  r  =  polylog  [n). 

A  complication  that  arises  is  that  different  subtrees  can  have  different  cost  ratios.  For  the  moment, 
though,  we  concentrate  on  the  much  simpler  problem  of  finding  an  algorithm  when  the  cost  ratios  are  equal. 

In  using  cost  ratios,  we  speak  of  unfair  competitiveness,  a  notion  introduced  by  Blum  et  al.  and 
formalized  by  Seiden  [BKRS92,  Sei99].  We  say  algorithm  A  has  /  -unfair  competitive  ratio  p  with  additive 
b  if  for  all  event  sequences  T,  algorithm  A  outputs  an  action  sequence  v 4  so  that  for  all  action  sequences  v, 

E[moue(v^)  +  r  /oc«/(T,  V4)]  <  p(move(v)  +  local{ T,  v))  +  b  .  (2.1) 

The  only  difference  between  this  definition  and  the  definition  of  the  competitive  ratio  is  the  appearance  of  r 
on  the  left-hand  side. 

The  first  approach  to  consider,  as  Bartal  did,  is  to  analyze  Marking  in  this  unfair  setting  [Bar96], 

Theorem  2.5  Marking  has  r -unfair  competitive  ratio  (/•  +  1 )  Hn  for  a  uniform  metric  space  of  n 
nodes. 

Proof.  We  analyze  by  phases.  Any  action  sequence  must  pay  at  least  1  in  each  phase;  we  argue  that 
Marking’s  expected  unfair  cost  is  at  most  (r  +  l)Hn.  Consider  the  first  state  to  become  marked. 

The  probability  that  Marking  ever  goes  to  this  state  is  and  if  so  then  Marking  pays  at  most  r  +  1 
for  this  state  (at  most  r  in  local  costs,  and  1  in  movement  cost  after  it  becomes  marked).  Thus  the 
expected  cost  to  Marking  at  this  state  is  at  most  2^.  Now  consider  the  second  state  to  become 
marked.  The  probability  that  Marking  ever  goes  to  this  state  is  and  if  so  then  Marking  pays  at 
most  r  +  1  for  this;  thus  the  expected  cost  to  Marking  at  this  state  is  Generally,  at  the  /th  state 
to  become  marked  in  the  phase,  Marking  expects  to  pay  at  most  at  that  state.  We  sum  over 

all  states  to  get  (r  +  1  )Hn.  ■ 

It  is  not  too  difficult  to  imagine  what  happens  when  we  apply  Marking  recursively  to  a  tree.  Because 
of  the  rHn  term  to  the  competitive  ratio,  what  effectively  happens  is  that  the  Hn  terms  multiply  so  that  for 
an  L -level  //.-HST,  the  competitive  ratio  is  roughly  O(H^).  The  1-level  subtrees  have  ratio  0(Hn),  but  to 
construct  the  algorithm  for  the  2-level  subtrees,  we  must  take  r  =  0(Hn)  to  account  for  the  performance 
of  the  1-level  subtrees  below,  giving  a  ratio  of  0{H2n)  overall.  Likewise,  the  3-level  subtrees  have  a  ratio 
of  0(H%),  and  so  on.  We  have  neglected  some  details  (notably,  we  have  ignored  details  about  exactly  how 
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T]  =  (i.O.  0) 
T2  =  (H,0) 
T3  =  (3.  3,  2 ) 
T1  =  (0.  0,  3) 


Figure  2.4:  A  very  simple  2-HST  and  task  sequence. 


we  combine  the  subalgorithms,  and  we  have  ignored  the  additive  b),  but  this  is  roughly  what  happens  in 
recursively  applying  Marking  to  an  HST. 

By  choosing  h  to  balance  the  metric-space  approximation  ratio  h  against  the  number  of  levels  O (log/,  A ) , 
Bartal  proves  the  following  theorem. 

Theorem  2.6  ([Bar96])  Given  a  metric  space  with  A  as  the  ratio  of  longest  to  shortest  distance,  we 
choose  h  =  2'^lgAlgH".  By  recursively  applying  Marking  to  an  h-HST  probabilistically  approxi- 
mating  the  original  metric  space,  we  get  a  competitive  ratio  o/20^l<lgA  logl,>g 

In  many  cases  (such  as  a  shortest-path  metric  in  an  unweighted  graph)  this  bound  is  an  improvement  on  the 
earlier  linear  bounds  [BLS92,  IS98],  but  it  is  still  much  worse  than  the  conjectured  O(log  v)  possibility. 


2.4  Bounding  a  competitive  ratio 

The  key  problem  with  the  Marking  approach  is  that  Marking’s  unfair  competitive  ratio  multiplies  the  ratio  r 
by  2 H„  =  0(log  n).  A  ratio  of  r-f  0(log  v)  would  be  much  more  useful,  as  we  could  potentially  add  merely 
0(log  n)  for  each  level  of  the  HST.  In  this  section,  we  see  how  we  can  rigorously  use  such  an  algorithm  A 
with  an  r-unfair  competitive  ratio  of  r  +  o  (»)  to  recursively  construct  an  algorithm  for  an  L-level  //-HST 
with  a  (fair)  competitive  ratio  of  Lo  [n),  for  h  sufficiently  large. 

The  techniques  used  here  are  later  reused  with  less  description  in  the  polylog  [n)  result.  For  that  result, 
we  must  work  around  the  fact  that  an  //-HST  could  have  many  levels.  For  example,  the  space  defined  by 
placing  points  at  1,2, 4, . . .  ,  2’1-1  on  the  number  line  will  give  an  HST  of  (log/,  2"-1 )  levels.  It  turns  out, 
though,  that  by  being  more  careful  with  how  we  combine  subspaces  if  one  is  much  larger  than  others,  we 
can  get  the  polylog (n)  result.  We  will  see  this  approach  in  Theorem  4.8. 

To  run  A  recursively  on  an  HST  T  with  each  point  of  the  space  representing  a  subtree  of  T,  we  must 
decide  when  a  point  representing  a  subtree  incurs  a  task-processing  cost.  We  accomplish  this  by  maintaining 
the  work  function  OPT  for  the  points  in  that  subtree  alone.  (That  is,  points  in  other  subtrees  cannot  pin  any 
points  in  the  subtree.)  The  point  representing  a  subtree  incurs  a  loss  each  time  the  minimum  work  function 
within  that  subtree  increases.  The  amount  of  the  loss  is  scaled  down  by  the  diameter  of  T  (technically,  a 
little  less)  and  fed  into  A. 

As  A  progresses  at  the  root  level  of  the  tree,  it  will  occasionally  move  from  one  subtree  to  another.  When 
this  occurs,  the  overall  algorithm  continues  running  .4  at  that  level,  but  for  the  lower  levels  of  the  HST  (which 
have  now  changed  subtrees)  A  begins  anew.  Restarting  the  algorithm  in  this  way  does  not  affect  the  work- 
function  computation  for  the  level  where  the  movement  occurs,  but  the  work-function  computation  at  the 
lower  levels  does  begin  from  scratch. 

Example  2.2  To  get  a  handle  on  the  subtleties  of  this  scheme,  we  consider  an  example.  We  work 

with  running  Marking  recursively  on  the  HST  and  task  sequence  of  Figure  2.4.  (The  choice  of 
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Marking  is  inappropriate:  It  does  not  have  the  required  r  +  a(n)  competitive  ratio.  But  Marking 
suffices  for  this  illustration.) 

Initially  algorithm  Ai  chooses  between  the  left  subtree  and  right  subtree  with  equal  probability; 
say  it  chooses  the  left  subtree.  Then  algorithm  A 2  runs  and  chooses  between  its  left  subtree  and  right 
subtree  equally;  say  it  chooses  the  left,  so  that  the  algorithm  for  the  HST  is  initially  at  node  q0. 

On  receiving  T1  =  (i,  0, 0),  the  status  of  A\  does  not  change;  although  the  work  function  for  u 
increases  by  the  minimum  work  function  within  the  subtree  rooted  at  A2  is  still  0.  However,  the 
work  function  for  left  subtree  of  A2  has  increased  by  |.  Thus  Marking  at  A2  increases  the  counter 
for  the  left  subtree  by  |  (we  divide  the  increase  by  the  diameter  of  the  space  A-2).  Algorithm  A2 
does  not  move  from  qo,  and  so  we  remain  at  qo  to  process  the  first  vector. 

On  receiving  T2  =  ( | ,  \ ,  0) ,  the  work  function  for  both  subtrees  of  A2  increases  by  \ ;  thus  now 
the  counters  for  A2  are  at  1  and  Now  its  left  subtree  (</o)  is  marked,  so  A2  will  move  to  the  right 
subtree  (91).  At  the  root  level,  the  left  subtree's  minimum  work  function  is  now  I,  and  so  Ai’s  left 
counter  increases  from  0  to  |  (remember  that  we  scale  by  the  space’s  diameter);  A\  does  not  move. 

So  the  algorithm  processes  the  second  vector  at  qi . 

For  task  T3  =  (3, 3, 2),  the  work  function  for  Ai 's  left  subtree  increases  by  3,  so  that  A\ ’s  left 
subtree  counter  increases  from  |  to  g.  Meanwhile,  A\ 's  right  subtree’s  work  function  increases  by 
2,  so  A\ ’s  right  subtree  counter  increases  from  0  to  Thus  A\ ’s  left  subtree  becomes  marked,  and 
it  moves  to  the  right  subtree.  The  algorithm  processes  T3  at  node  q2. 

Finally,  consider  the  task  T4  =  (0, 0, 3).  This  increases  the  work  function  for  A 1  ’s  right  subtree 
by  3,  so  that  Ai’s  right  subtree  counter  becomes  §.  Now  the  right  subtree  of  A\  is  marked,  and  so 
Marking  resets  the  counters  and  begins  at  a  random  space.  Say  it  randomly  chooses  the  left  subtree. 
Then  A2  begins  anew  with  work  function  and  counters  at  0;  say  it  chooses  the  left  also.  Then  the 
algorithm  processes  T4  at  node  qo. 

In  this  example,  we  treated  the  tree  as  an  entire  entity.  We  now  look  at  what  A\  saw.  It  saw  the 
following  task  sequence. 

Ti=<0,0> 

Tj  =  (i,0> 

Tj  =  <!,!> 

Tj  =  (0,l> 

As  a  Marking  algorithm  using  r  =  1,  Ai  is  in  either  tree  with  equal  probability  for  tasks  T1,  and 
T|.  The  left  subtree  becomes  marked  with  T3,  and  so  A\  processes  in  the  right  subtree.  With 
T^,  the  right  subtree  becomes  marked  also,  and  so  A\  clears  its  marks  and  chooses  a  random  subtree 
for  T^. 

To  bound  the  performance  of  our  recursive  application,  we  must  have  a  bound  on  the  magnitude  of  the 
additive  part  (the  b  of  our  definition  of  competitive  ratio  in  (2.1)).  We  need  h  to  be  about  as  large  as  b,  so 
that  when  the  subtree  algorithm  restarts,  the  additive  part  (which  we  may  pay)  will  be  only  a  constant  factor 
more  than  it  cost  us  to  move  into  the  subtree.  We  will  see  this  in  the  mathematics  of  the  formal  proof. 

Theorem  2.7  Say  algorithm  A  has  r -unfair  competitive  ratio  r  +  a(n)  with  additive  ft ( n )  >  2  on 
the  uniform  metric.  The  competitive  ratio  of  running  A  recursively  on  an  L-level  (2.5 ft(n))-HST 
with  diameter  D  is  at  most  1  +  4 a(n)L  with  additive  5 ft(n)D. 

Remark.  In  running  A,  we  take  r  to  be  |  times  the  maximum  ratio  of  the  subtrees’  algorithms; 
ft(n)  is  computed  using  this  value. 

Proof.  We  prove  this  by  induction  on  L.  The  trivial  single-point  space  handles  the  base  case  L  =  0. 

Say  we  have  an  L-level  HST  of  diameter  D,  and  let  p  be  the  maximum  competitive  ratio  of 
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the  subtrees'  algorithms,  at  most  1  +  -1  a  (v)(L  —  1).  The  additive  part  is  5/i(w)  times  the  subtree’s 
diameter  of  at  most  2  r^ny  for  a  product  of  2D. 

To  bound  the  overall  performance,  we  will  want  to  use  our  inductive  hypothesis  and  the  /  -unfair 
competitive  ratio  of  .4.  To  discuss  .4's  performance,  we  define  T4  as  the  task  sequence  that  .4  sees. 
That  is,  •  is  times  the  change  in  minimum  work  function  in  subtree  /  as  a  result  of  the  actual 
task  vector  T  ’ ,  where  we  compute  the  minimum  work  function  in  subtree  /  considering  only  those 
states  in  the  subtree  (i.e.,  in  this  computation,  states  in  other  subtrees  cannot  pin  states  in  subtree  /). 
We  divide  the  change  in  work  function  by  | D  rather  than  simply  D  because  of  the  effect  which  will 
soon  appear  of  the  additive  part  of  the  subalgorithms'  ratios. 

To  bound  the  competitive  ratio  for  our  complete  algorithm  (which  combines  A  with  the  subtrees’ 
algorithms),  consider  an  arbitrary  action  sequence  v  on  the  entire  space  of  v  points.  This  implies 
an  action  sequence  v/,  specifying  in  which  subtree  (not  state)  to  process  each  task.  To  use  ,4’s 
competitive  ratio,  we  want  to  bound  from  below  the  total  off-line  cost  to  v  in  terms  of  local ( T4 ,  v/>) 
and  move  it  (vi,),  since  their  sum  is  what  .4  can  compete  against.  (We  use  more p  to  represent  the 
movement  cost  on  the  diameter-1  uniform  space  that  .4  uses.)  The  first  apparent  (but  flawed)  answer 
is  local ( | DTa,  v&)  +  D  mocci-(vk).  To  understand  this,  consider  a  segment  of  time  where  v  stays 
within  the  same  subtree.  The  algorithm  must  move  into  the  subtree,  at  a  cost  of  D.  And  because  the 
work  function  within  the  subtree  increases  according  to  '\DTA  within  the  segment,  the  off-line  cost 
increases  with  f  DT4.  Summing  over  all  segments,  we  get  local { '^DT,\ .  v/,)  +  D  macc(r(vk). 

But  local{\DTA ,  v\,)  is  not  accurate:  The  minimum  cost  for  processing  a  segment  of  v  remain¬ 
ing  in  the  same  subtree  should  be  computed  using  work-function  values  starting  at  0,  but  the  work- 
function  values  used  to  compute  T4  are  not  all  equal  (except  for  the  first  segment).  In  fact,  for  each 
of  these  mover  {vi,)  segments,  the  actual  optimal  cost  within  the  segment  and  the  cost  represented 
by  T.4  may  differ  by  as  much  as  the  diameter  of  the  subtree,  which  is  at  most  2  .  So  the  first  ap¬ 
parent  answer  local( f  DT4.  vk)  +  D  movcp(vk)  may  be  wrong  by  as  much  as  u(vb)- 

Thus  the  total  cost  for  v,-,  is  at  least 

local(%DTA,Vb)  +  (d  -  -  r^v^j  moPciT(vb)  >  heal(^DTA.vk)  +  | D  mover (v/,)  . 

Now  we  look  at  what  algorithm  .4  does.  Let  v.4  represent  the  sequence  of  moves  that  A  makes  at 
the  top  level  of  the  HST.  Within  a  single  segment  of  V4  staying  within  a  single  subtree,  the  expected 
cost  (according  to  the  inductive  hypothesis)  is  at  most  p  times  the  optimal  cost  for  servicing  this 
segment,  plus  2D.  Again,  it  is  tempting  to  use  T4  to  bound  the  optimal  cost  for  servicing  the 
segment,  but  work-function  discrepancies  mean  this  estimate  may  be  off:  The  proper  way  to  compute 
the  optimal  cost  is  with  the  work  function  zero  at  all  states  at  the  beginning,  while  when  the  algorithm 
moves  into  the  subtree,  the  work  function  varies  between  states.  In  this  case,  however,  the  perceived 
cost  (that  is,  what  T4  indicates)  is  at  most  the  actual  cost,  since  the  computation  using  T4  only 
happens  to  believe  that  some  of  the  states  have  incurred  more  cost  than  the  minimum  among  the 
states,  whereas  in  fact  they  have  not.  Thus  within  each  of  the  move  ir(vA)  +  1  segments  of  V4,  our 
expected  cost  is  at  most  p  times  the  local  cost  (according  to  the  task  sequence  T4  that  .4  sees)  plus 
2D.  Adding  another  D  for  each  time  we  move  between  segments,  our  total  cost  is  at  most 


p  local  (IDT4.V4)  +  3 D  movcr(vA)  +  2D  . 


Of  course  V4  is  actually  a  random  variable  based  on  .4’s  random  choices.  Since  .4  has  com¬ 
petitive  ratio  r  +  o  (?/),  we  know  that  for  an  arbitrary  action  sequence  v/,,  ,4's  expected  cost  is  at 
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most 


E  [p  local  (lDTA,vA)  +  3 D  moveir(vA)  +  2D] 


< 

< 

< 


3D  E 


local  ( T-i ,  v  4 )  +  move  it  (vA) 


+  2D 


3D  +  «(»))  (local  (TA,Vb)  +  movtjj (v4 ) )  +  /3(n)j  +  2D 


(p  +  4a(n))  ( local  (|DT4,  v;>)  +  | D  moveu(vA))  +  3 /3(n)D  +  2D 
(1  +  4q(??)L)  ( local  (|DT4,  v/,)  +  | D  movep (v^))  +  5/3 («)  D 
(1  +  4q(??.)L)  (local ( T.  v)  +  moue(v))  +  5/3 (n)D 


Thus  we  conclude  that  our  overall  competitive  ratio  for  the  HST  is  1  +  4 a(n)L,  plus  an  additive 

5  l3(n)D.  ■ 

Our  goal  now  is  to  demonstrate  an  algorithm  with  an  /’-unfair  competitive  ratio  of  r  +  O(log  n).  One 
way  to  this  goal  is  to  detour  into  machine  learning  theory.  We  pursue  this  now. 
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Chapter  3 

The  expert  prediction  problem 


As  the  MTS  problem  is  foundational  to  competitive  analysis,  so  the  problem  of  prediction  from  expert 
advice  is  foundational  to  on-line  machine  learning  theory.  It  has  several  specific  formulations.  In  this 
chapter  we  first  look  at  one  of  the  more  traditional  formulations,  Experts-Predict,  and  then  we  examine 
more  closely  a  “decision-theoretic”  formulation.  From  there  we  can  derive  new  analyses  of  algorithms  in 
the  decision-theoretic  formulation  that  do  well  with  a  particular  goal  called  partitioning  bounds ,  and  we  can 
attempt  to  translate  these  bounds  to  the  r -unfair  MTS  problem. 


3.1  Classical  formulation 

Littlestone  and  Warmuth  proposed  the  initial  Experts-Predict  problem. 

Problem  Experts-Predict  ([LW94])  We  see  a  set  of  n  experts.  For  each  time  step,  each  expert 
makes  a  Boolean  prediction.  We  decide  on  a  Boolean  prediction,  and  then  we  learn  the  correct 
answer.  Our  goal  is  to  minimize  the  number  of  mistakes  we  make  relative  to  the  most  accurate 
expert.* 

For  example,  we  might  think  of  the  experts  as  meteorologists  predicting  whether  it  will  rain  tomorrow. 
We  want  to  predict  well  relative  to  the  most  talented  among  them  without  too  many  mistakes  along  the  way. 

From  a  learning  perspective,  this  question  models  a  situation  where  we  have  a  set  of  hypotheses  (termed 
experts ),  one  of  which  predicts  fairly  accurately  how  the  world  operates.  The  question  is  how  quickly  we 
can  converge  on  a  good  predictor.  Thus,  our  goal  is  to  bound  how  much  worse  we  do  relative  to  the  best 
single  expert. 

The  mistake  bound  of  an  algorithm  bounds  the  number  of  mistakes  the  algorithm  makes  [Lit88].  In 
contrast  to  much  of  machine  learning,  mistake  bounds  do  not  employ  distributional  assumptions.  That  is, 
the  experts  need  not  perform  uniformly  over  time  in  any  sense.  Despite  the  absence  of  such  assumptions, 
the  theoretical  bounds  obtained  are  surprisingly  good. 

*The  algorithms  actually  extend  to  bounded  real- valued  predictions,  with  a  loss  function  (such  as  square  loss  or  log  loss) 
assigning  the  penalties.  With  the  square  loss  function,  for  example,  if  an  expert  predicts  x  and  the  true  answer  is  y,  the  loss  is 

(*  -y)2- 
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If  one  of  the  experts  predicts  perfectly,  then  the  Halving  algorithm  does  optimally  for  deterministic 
algorithms. 

Algorithm  Halving  ([Mit82])  We  keep  track  of  a  set  P  of  experts,  initially  including  all  of  them. 
Each  time  step,  we  predict  whatever  the  majority  of  experts  in  P  predict.  Once  we  receive  the  true 
answer,  we  remove  from  P  all  experts  who  predicted  wrong. 

Obviously,  each  time  Halving  predicts  wrong,  the  size  of  P  goes  down  by  at  least  half.  Thus  the  mistake 
bound  of  Halving  is  [lg  »  J . 

When  none  of  the  experts  predict  perfectly,  the  problem  becomes  harder.  One  simple  approach  (as  we 
saw  with  Marking)  is  to  proceed  in  phases:  In  each  phase,  we  run  Halving  until  P  becomes  empty.  If  the 
best  expert  makes  m  mistakes,  then  this  phased  version  of  Halving  makes  at  most  m  [lg  n J  mistakes. 
Littlestone  and  Warmuth's  weighted-majority  algorithm  WM  does  significantly  better. 

Algorithm  WM  ([LW94])  We  use  a  parameter  3  £  (0. 1)  and  maintain  a  weight  w,  with  each 
expert,  initially  wf  =  1.  At  time  step  /,  we  predict  according  to  a  weighted  majority  of  the  experts, 
where  each  expert  gets  a  weight  of  wj-1 .  Once  we  learn  the  correct  answer,  we  update  the  weight 
of  each  expert  who  was  mistaken  to  become  w-  <—  w[_1i. 

Example  3.1  Take  ft  =  Say  we  have  four  experts,  j-0,  .r  i . . r-2 ,  and  .r.3.  Our  weights  are  initially 
w°  =  (1, 1, 1, 1). 

Say  that  x0  predicts  false  on  the  first  time  step  while  the  others  predict  true.  Then  we  predict 
true,  since  it  has  weight  3  while  false  has  weight  1.  We  then  learn  the  time  answer,  false  in 
this  example.  We  update  the  weights  to  become  w1  =  (1 , 3,  3.  5). 

Say  that  .r0  and  .rj  predict  true  on  the  second  time  step,  and  .r2  and  .r:i  predict  false.  Then 
true  has  weight  |  while  false  has  §;  our  algorithm  predicts  true.  If  this  is  correct,  then  the 
weights  are  updated  to  become  w2  =  (1.  |,  ^). 

On  the  third  time  step,  if  .r0  predicts  false  and  the  others  predict  true,  then  we  predict 
false,  since  false  has  weight  1  and  true  has  weight 

The  beauty  of  WM  lies  in  the  fact  that,  despite  its  simplicity,  its  bound  is  quite  strong.  The  proof  is  cute; 
we  repeat  its  technique  several  times  in  this  chapter. 

Theorem  3.1  ([LW94])  For  any  expert  k,  WM  has  mistake  bound 


ni\NM  < 


an  In  (  ~  I  +  hi  v 


where  in  /,-  is  the  number  of  mistakes  made  by  expert  k. 

Remark.  To  make  better  sense  of  this  bound,  let  3  =  1  —  2s  for  small  5.  Then  the  bound  translates 
to  approximately  2(1  +  e)m*.  +  7  In  n.  Intuitively,  this  is  an  explicit  trade-off  between  how  quickly 
we  settle  on  a  particular  expert  (the  7  In  n  term)  and  how  quickly  we  are  able  to  adapt  if  that  expert 
is  actually  bad  but  happens  to  do  well  for  the  first  several  rounds  (the  2(1+  s)m term). 

Proof.  [LW94]  Define  W'  =  w- t0  be  the  total  weight  at  time  t  and  say  »»wm  is  the  number  of 
mistakes  WM  makes.  If  WM  makes  a  mistake  at  time  t,  then,  since  at  least  W'~ 1  /2  weight  is  on  the 
experts  that  err,  the  total  weight  decreases  by  at  least  (1  —  ft)(W~1  /2).  Thus  when  WM  makes  a 
mistake,  W#  isatmostW#_1  —  (1  —  ft)(Wt~1  /2)  =  Since  WM  makes  myvM  mistakes, 

and  since  W°  =  n,  the  final  total  weight  is  at  most  ( WM n • 
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On  the  other  hand,  Wfnal  is  at  least  the  final  weight  wfnal  of  expert  k,  which  is  exactly  3r"k . 
Thus  we  have 


(1  ,  o\  mwM 

~Y~)  11 


From  here  we  take  logarithms  and  solve  for  mwM  to  get  the  result. 


This  bound  is  very  close  to  twice  the  best  expert’s  loss.  Moreover,  it  says  that  we  can  double  the  number 
of  experts  (refining  the  hypothesis  space  by  a  factor  of  two)  with  very  little  increase  in  worst-case  loss.  A 
major  strength  of  the  theory  of  expert  advice  is  how  tight  a  bound  we  get  with  the  very  simple  algorithm 
WM. 

Both  Halving  and  WM  are  deterministic.  A  randomized  version  of  WM,  which  chooses  experts  ran¬ 
domly  based  on  the  weight  distribution,  roughly  halves  the  bound  on  the  expected  loss  to  (1  +  e)m-k  + 
^7  In  n  [LW94],  We  see  a  proof  of  this  in  the  Experts  problem,  an  alternative  formulation  of  ExpertS- 
Predict. 


3.2  Decision-theoretic  formulation 

Freund  and  Schapire  abstract  away  the  aspect  of  combining  expert  predictions  to  arrive  at  what  they  term 
a  “decision-theoretic”  formulation  of  Experts-Predict  [FS97].  We  use  this  formulation  throughout  the 
remainder  of  this  dissertation,  so  we  refer  to  this  problem  simply  as  Experts. 

Problem  Experts  ([FS97])  We  see  a  set  of  n  experts.  For  each  time  step  /.  we  choose  an  expert 
v*.  Then  we  learn  the  loss  vector,  t ,  which  specifies  the  loss  t\  €  [0, 1]  of  each  expert  for  that  time 
step.  We  incur  the  loss  of  the  chosen  expert,  Our  goal  is  to  minimize  the  total  loss  we  incur. 

Any  deterministic  algorithm  for  Experts  does  at  least  n  times  worse  than  the  best  expert  in  the  worst 
case.  An  adversary  can  construct  a  worst-case  sequence  by  simulating  the  algorithm  and  each  time  step 
giving  a  loss  of  1  to  the  expert  that  the  algorithm  will  choose  and  a  loss  of  0  to  the  other  experts.  Thus  after 
T  time  steps,  the  algorithm’s  cost  is  T,  while  the  best  expert’s  loss  is  at  most  T/n.  Since  O(n)  bounds  are 
undesirable,  we  restrict  our  attention  to  randomized  algorithms. 

Given  that  one  of  the  experts  is  perfect  (that  is,  if  for  some  i,  at  all  times  t  we  have  t\  =  0),  we  can 
use  the  following  algorithm  Rand-Halving,  a  randomized  version  of  Halving  and  a  degenerate  instance  of 
Hedge  (discussed  later).  It  has  a  loss  of  at  most  Hn. 

Algorithm  Rand-Halving  Let  P  be  a  set  of  experts,  initially  including  all  experts.  Each  time  step, 
we  pick  our  an  expert  uniformly  at  random  from  P.  Once  we  receive  the  loss  vector,  we  remove 
from  P  all  experts  who  incur  some  nonzero  loss. 

When  all  experts  incur  some  loss,  the  problem  becomes  more  complicated.  The  Hedge  algorithm 
is  Freund  and  Schapire’s  Experts  adaptation  of  WM  [FS97],  (In  fact,  the  coefficients  of  Theorem  3.2’s 
guarantee  are  optimal  for  on-line  algorithms  [Vov95,  FS97].) 

Algorithm  Hedge  ([FS97])  We  use  a  parameter  £  (0,1)  and  maintain  a  weight  w;  with  each 
expert,  initially  w-  =  1.  At  time  step  t,  we  choose  expert  i  with  probability  proportional  to  its 
weight,  w--1/  w*-1.  Given  the  loss  vector,  we  update  the  weight  of  each  expert  to  become 

wf «—  w-_1/)£*'. 

Theorem  3.2  ([FS97])  If  an  expert  k  incurs  total  loss  lossk,  then  Hedge  incurs  expected  loss  at 
most 

.  lnl//3  1 

E[/o.SSHedge]  <  1  _  ^  lossk  +  j  In  tt  . 
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Remark.  Again,  to  get  a  feel  for  the  tradeoff,  we  make  better  sense  of  this  bound  by  letting 
fi  =  1  —  2 s  for  small  s.  Then  the  mistake  bound  translates  to  approximately  ( 1  +  f )  loss).  +  ^  In  n, 
roughly  a  factor  of  2  less  than  WM’s  bound. 


Proof.  [FS97]  Let  W1  be  the  total  weight  ^  •  w-  at  time  /,  and  let  L1  be  the  expected  loss 
at  time  t.  Note  that 
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to  Hedge 
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As  in  the  proof  of  Theorem  3.1,  we  bound  .  We  bound  in  terms  of  W 1  1 . 


W*  - 


< 

< 


]T  =  ii3e< 

i  i 

^w;-,(l-(l-/3)£j)=W'-I(l-(l 

i 

w' 


T1-  1f-(l-/t)L' 


/*)!*') 


We  can  now  bound 


<  W'"if  pj  (  —(1  — /-?)L'  _  Hr-(1-d)Ef  L< 

/ 

For  the  lower  bound  on  we  know  it  is  at  least  which  is  exactly  /i,os'n'.  Thus  we  have 

the  inequality 

flios'k  <  E,  17  , 

which  we  solve  for  E[/o.s-.SHedge]  =  L'.  ■ 


3.3  Partitioning  bound 

Until  this  point,  we  have  contented  ourselves  with  bounding  performance  against  the  best  single  expert  over 
all  time  steps.  The  partitioning  bound  is  a  more  ambitious  goal.  Here  we  try  to  do  well  against  all  partitions 
of  time  into  intervals,  where  we  pick  the  best  expert  within  each  time  interval  of  the  partition.  Being  able  to 
do  well  against  all  partitions  includes,  for  example,  scenarios  where  one  expert  does  very  well  for  the  first 
half  of  time,  whereas  another  expert  does  best  on  the  last  half  of  time.  For  a  good  partitioning  bound,  an 
algorithm  must  adapt  particularly  quickly  to  changed  expert  performance. 

Formally,  given  a  partition  P  of  time  into  intervals,  let  kp  be  the  number  of  intervals.  We  let  L  j,  be  the 
loss  of  the  best  expert  within  the  jth  interval,  and  we  let  Lp  be  the  total  loss  over  all  intervals,  X^=i  M'- 
The  partitioning  bound  of  algorithm  .4  will  be  some  bound  on  its  expected  loss  of  the  form 

E[/oss_4]  <  aLp  +  bkp 


for  some  coefficients  a  and  b. 

We  hope  to  find  a  generalized  bound  similar  to  Theorem  3.2’s  bound  for  Hedge,  a  bound  of  the  form 

E[lossA]  <  (1  +  s)Lp  +  7 kp  In  v  . 

We  examine  two  variants  of  Hedge,  Thresh  and  Share,  that  achieve  this  type  of  bound.  In  Section  7.2,  we 
see  another  variant  called  Phased-Hedge. 
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Thresh 


The  first  of  these  algorithms,  Thresh,  is  an  adaptation  of  Littlestone  and  Warmuth’s  WML  algorithm  to  the 
Experts  problem  [LW94]. 

Algorithm  Thresh  We  use  parameters  j3  £  (0, 1)  and  a  £  (0,  and  maintain  a  weight  w*  for 
each  expert,  initially  w?  =  1.  At  time  step  t ,  we  compute  the  total  weight  W/_1  =  w;-1>  anc^ 

we  let  S*_1  be  the  set  of  experts  i  with  w-"1  >  Define  W*"1  as  the  total  weight  in  S*“\ 

w!_1-  We  choose  expert  i  with  probability  w*~1/Wt~1  if  i  £  S/_1  and  with  probability 
0  otherwise.  Given  the  loss  vector  £f ,  for  each  expert  i  £  S*”1,  we  update  its  weight  to  become 
w \  £-  ;  we  do  not  change  weights  for  i  S*-1 . 

Theorem  3.3  Given  n  experts ,  Thresh  incurs  expected  loss  at  most 


(_wm\ 


Lp  + 


(  In  [n/j3a)  \ 


for  any  partition  P. 

Remark.  For  small  e,  let  o  =  £  and  j3  =  1  -  2  c.  As  n  becomes  very  large,  the  bound  of 
Theorem  3.3  translates  to  approximately 


(1  +e)Lp  +  (l  +£  +  pn  n)  kp  . 


If  we  restrict  our  attention  to  kp  =  1  (the  case  considered  in  Theorem  3.2),  we  see  that  this  effec¬ 
tively  generalizes  the  bound  of  Hedge,  at  the  loss  of  only  a  factor  of  2  in  the  coefficient  to  In  n. 


Proof.  [LW94]  Note  that  Wf  >  (1  —  aj  W'  for  all  t,  and  let  L/  be  the  expected  loss  to  Thresh  at 
time  t.. 


i*-  E 


W. 


t~ 1 


'  W/-1  % 

ie  s'-1  v 


As  in  Theorem  3.2’s  proof,  we  bound  how  a  single  step  alters  the  total  weight. 


W#  =  £  /^wj"1  +  £  W*_1 

<  e  n-u  E  "!_1 

tes'-1  igs*-1 


t-  i 


W"1!  l-d-«  E 


ie  sf 


w,*_1  i’i 


<  w->  1-(1 -«(!-.)  E  WM 

\  ies'-1  w 

=  Wt_1  (l  —  (1  -  f3)(l  -  a)Lf) 


(3.1) 


Consider  any  partition  P ,  and  examine  segment  j  of  the  partition,  where  the  best  expert  (call  it 
k)  incurs  loss  Lp.  Say  that  the  total  weight  at  the  segment’s  beginning  is  Wm,t  and  the  total  weight 
at  the  segment’s  end  is  W-final.  Because  Thresh  never  allows  a  weight  to  fall  below  W‘,  the 
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initial  weight  of  expert  k  in  the  segment  is  at  least  fi~  Wm,f.  Thus  at  the  segment’s  end,  expert  A*’ s 
weight,  and  hence  is  at  least  jfir  ;3^WlTUt .  Applying  bound  (3.1),  we  have 

<  W/fna/  <  W'"'f  (1  -  (1  -  /j)(]  -  a)V)  . 

U  1 

So  we  have 

Li  In  /?  +  In  —  <  -(1  -  ii){  1  -  o)  V  L'  . 

/ 

which  gives  us  bound  on  the  segment’s  expected  loss  of 

V-  T  f  _Jn(V£)_T  j  ,  ln(»//io) 

V  “  (l-/3)(l-o)Lp+(l-/i)(J-o)‘ 

Summing  over  segments,  we  get  the  desired  bound.  ■ 


Share 


We  also  examine  Share,  an  alternative  to  Thresh.  This  is  an  adaptation  of  Herbster  and  Wamiuth’s 
Variable-Share  algorithm  to  the  Experts  environment  [HW98], 

Algorithm  Share  We  use  parameters  ii  C  (0. 1)  and  n  £  (0.  !  | .  and  maintain  a  weight  w,  for 
each  expert,  initially  w^1  =  1.  At  time  step  t,  we  choose  expert  /  with  probability  proportional  to 
its  weight,  w-_1  /  ^  w*_1 .  Given  the  loss  vector,  we  update  the  weight  of  each  expert  to  become 

w-  w?-1/#  +  A*,  where  A'  is  Yli  (w|_1  _ 

The  update  rule  used  by  this  algorithm  can  be  viewed  as  follows.  We  first  update  as  usual:  w-  4—  w . 
This  reduces  the  sum  of  the  weights  by  some  amount  A'.  We  then  distribute  an  n  fraction  of  this  A'  evenly 
among  the  n  experts  Af  each). 

Theorem  3.4  Given  n  experts,  Share  incurs  expected  loss  at  most 


(  ln(l//3)  \ 


Lp  + 


(  ]"{"M  ) 


for  any  partition  P . 

Remark.  For  small  s,  let  o  =  ~  and  $  =  1  —  2s.  As  n  becomes  very  large,  the  bound  of 
Theorem  3.4  translates  to  approximately 


(1  +  s)Lp  +  j  In  n  kp  . 

That  is,  we  get  about  the  same  tradeoff  we  saw  with  Hedge  and  Thresh. 

Proof.  Given  a  partition  P,  we  consider  segment  i  of  the  partition.  Let  L i  be  the  expected  loss  to 
Share  at  time  step  t  within  the  segment.  Say  expert  k  is  the  best  expert  of  the  segment  (with  loss 
Lp).  Our  goal  is  to  show  that  the  algorithm’s  expected  loss  L f  is  at  most 

ln(l//3)L^  +  ln(??/a) 

(1  -  /3)(1  -  o)  (  •  } 


Such  a  bound,  summed  over  segments,  implies  the  theorem’s  bound. 
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Using  the  typical  multiplicative-update  analysis  (Theorem  3.2)  we  get 

W#  <  W^1  (1  -  (1  -  /5)(  1  -  a)L*)  . 

So,  if  Wm,t  is  the  sum  of  weights  at  the  segment’s  beginning  and  is  the  sum  of  weights  at 

the  segment’s  end,  then  Wflna/  is  bounded  by 

wfinai  <  W!'nif  JJ  (1  —  (1  —  /3)(1  —  a)Lf)  .  (3.3) 

t 

Now  consider  the  weight  of  expert  k.  At  time  t9  we  have  W*  =  Wi_1  -  A*  +  aAf\  and  so  A* 
is  -  W*).  Thus  the  amount  added  to  w [r1  due  to  the  share  update  is  - 

W*).  In  the  entire  segment,  therefore,  the  total  amount  added  to  wt  due  to  the  share  updates  is 
(T-q)n  (Wtmf  -  Wfinal).  Thus,  even  if  w[mf  is  zero,  by  the  end  of  the  segment  we  have 

wfa/  >  /3Vr  ^ ( Winit  -  Wfinal)\  ,  (3.4) 

since  the  worst  case  for  wfnal  is  if  the  penalties  for  the  expert’s  losses  come  after  the  sharing. 

For  convenience,  define 


t 


Combining  (3.3)  and  (3.4)  we  get 


P 


L'p 


<  pL‘p _ “ _ (yytntt  _  yyfinalj  <  w.mai  <  yyfinat  <  W™4!!. 


init.  xxt  final  \  ^  ...final  ,  \\T final  ^  xxrinit i 


(1  —  a)n 
We  can  now  solve  for  II. 

This  gives  us 


n  > 


(1  —  q  )  n 
I3l'p  a 


v  * 

hD/v 


PL'pa 


(1  —  a)n  +  fl^pa 


-  In  n  <  In  (  i  )  VP  +  In 


0- 


Recalling  the  definition  of  II,  we  notice  that 

-lnn>  (1 -/?)(!- a) 


SO 


ST^tI  <  ln(l/0Lp  +  htjn/a) 

V  “  (1- 0(1-0) 


as  we  desired  in  (3.2). 


27 


3.4  Translating  to  MTS 

The  Experts  and  MTS  problems  have  deep  similarities:  The  experts  correspond  closely  to  MTS  states,  and 
the  loss  vectors  correspond  closely  to  task  vectors.  This  gives  us  some  hope  that  Thresh  and  Share  can 
also  be  used  as  MTS  algorithms.  But  there  are  some  important  differences  between  the  problems. 

•  The  MTS  problem  includes  a  cost  for  switching  between  states/experts. 

•  An  MTS  algorithm  has  one-step  lookahead.  That  is,  first  the  cost  vector  is  announced,  then  the 
algorithm  chooses  whether  to  move,  and  finally  the  algorithm  pays  according  to  the  entry  in  the  cost 
vector  for  the  new  state.  In  contrast,  the  Experts  algorithm  has  zero  lookahead ,  in  that  it  first  pays 
and  then  moves. 

•  Because  of  the  lookahead,  MTS  algorithms  can  deal  with  unbounded  cost  vectors.  Large  losses  are 
actually  advantageous  to  an  on-line  MTS  algorithm  in  that  they  are  essentially  equivalent  to  allowing 
the  algorithm  to  “see  further  into  the  future.”  That  is,  an  adversary  trying  to  defeat  an  MTS  algorithm 
might  as  well  use  several  small  task  vectors  instead  of  a  single  large  task  vector,  so  that  the  algorithm 
is  not  sure  which  state  is  best.  (Theorem  4.1  formalizes  this  observation.) 

•  The  Experts  goal  of  doing  well  with  respect  to  the  best  expert  is  a  much  weaker  goal  than  the 
competitive-ratio  goal  of  doing  well  against  all  sequences.  Of  course,  because  the  goal  is  weak,  the 
Experts  bounds  are  very  good  (1  +  s  times  the  best  expert),  whereas  the  MTS  bounds  are  relatively 
poor  (0(log  n)). 

In  this  section  we  examine  how  our  two  Experts  algorithms  do  in  the  unfair  uniform-metric  MTS 
problem.  Later  (Chapter  6)  we  look  at  the  other  direction  —  how  MTS  algorithms  apply  to  the  Experts 
scenario. 

Thresh 

Thresh,  unfortunately,  does  not  translate  well  in  the  unfair  MTS  setting.  In  fact,  Thresh  does  not  have 
a  bounded  ratio  at  all.  Consider  the  two-expert  case.  Say  that  expert  2  incurs  a  loss  large  enough  for  its 
weight  to  drop  to  slightly  below  At  this  point,  the  algorithm  has  all  probability  on  expert  1.  Now 
suppose  expert  1  incurs  a  tiny  loss,  just  sufficient  to  bring  w-2  to  equal  &IV.  (Again,  W  stands  for  the  total 
weight  wi-)  This  forces  the  algorithm  to  move  ^  probability  over  to  expert  2.  Now  suppose  expert  2 
incurs  an  infinitesimal  loss  so  that  ic2  <  ^IL.  This  forces  the  algorithm  to  move  -VL  probability  back  to 
expert  1.  This  situation  can  repeat  indefinitely,  causing  the  algorithm  to  incur  unbounded  movement  cost 
with  insignificant  increase  in  the  off-line  optimal  cost,  giving  an  unbounded  competitive  ratio. 

Share 

The  problem  with  Thresh  is  that  it  does  not  control  its  movement  costs  very  smoothly.  Share,  however, 
does.  In  fact,  we  can  show  that  it  is  good  as  a  uniform-metric  MTS  algorithm.  The  bound  for  the  MTS 
setting  is  exactly  what  we  want  from  our  discussion  closing  Section  2.3.  (A  new  log  r  term  appears,  but 
this  is  not  problematic  since  we  can  assume  r  =  O  (■/?);  if  the  ratio  is  higher,  we  can  simply  apply  Work- 
Function  to  get  the  same  guarantee.) 

Theorem  3.5  We  use  Share  for  the  r -unfair  uniform-metric  MTS  setting  as  follows:  Given  a  task 

vector  we  give  rT/  to  Share  and  use  the  resulting  probability  distribution  to  choose  a  state. 

Given  any  7  >  2,  we  can  configure  a  and  il  in  Share  so  that  its  r -unfair  competitive  ratio  is 

p  =  r  +  3.2")  ln(/?(r  +  1))  +  4 
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with  an  additive  £. 

Remark.  In  the  proof,  we  choose  a  to  be  (r  +  1)_1.  For  / 3 ,  we  choose  it  to  be  i  1  ^  In  -' |  )  1  if 

r  >  7  In  (n(r  +  1))  and  1  otherwise. 

Proof.  Consider  any  off-line  strategy  v.  This  corresponds  to  a  partition  Fv  with  move(v)  +  1 
segments.  The  loss  Lp  of  the  partition  is  local  ( T,  v).  We  consider  the  local  cost  and  the  movement 
cost  incurred  by  Share  in  turn.  Theorem  3.4  shows  that  the  task-processing  cost  satisfies 


E[r  local ( T,  V4)]  =  E[/oc«/(rT,  V4)] 

MV/?) 


< 


(!-/?)(! -a) 


local  (rT,  v)  + 


ln(?i/o') 


(!-/?)(! -a) 


(1  +  move(v)) .  (3.5) 


(In  fact,  the  MTS  problem  allows  one-step  lookahead;  this  only  decreases  the  algorithm’s  cost.) 

To  analyze  the  movement  cost,  note  that  the  total  weight  W  only  decreases  with  time.  We  show 
that  for  any  time  step  t,  the  movement  cost  is  at  most  In  (1/ ft)  times  the  local  cost. 


P*) 


< 


< 


E 

M  ‘>P‘ 

E 

>!-> 

E 


w 


t- 1 


wr/&  + 


W'-1  w* 


w’-‘ 


i- p-  L>P- 


s  E 


W'"1  W* 

'  wj-1  w*"1/?*' 
W'-i  “  W'-1 

>p{  x 

w'-1  w 


Wt-i 

,t-l 


W'- 


1 


-<  e^.>»G) 


Thus  the  total  r -unfair  cost  to  Share  is  at  most 

(1+ln?)  (((TT W^j)rloca,{T ■ 


v)  + 


ln(n/o') 


(!-/?)(!  -a) 


(1  +  move(v)) 


< 


l  +  HI/ /3) 
(!-/?)(! -a) 


max  <  r  In 


1  ,  n ) 

T, 


l+lnfl/ffl  ,  n 

’  (!-/))(! -a)  a 


We  must  choose  the  values  of  a  and  /?  appropriately. 

If  r  >  7  ln(n(r  +  1)),  we  choose  a  =  (r  +  1)_1  and  f)  =  (l  +  T  In  £)-1.  Since  In  ^  <  j  In 
and  ( 1  -  /3 )  ~ 1  =  1  +  r /  (7  In  ^ ) ,  the  competitive  ratio  is 


1 +  !«(!//?) 


(!-/?)(! 

1  +  ^  In  (n/a) 


IP)  (  ,  1  .  »\ 

- r  max  <  r  In  — ,  In  —  } 

-a)  {  /3  a) 


< 


F>n (n/a)  (  r  \  f  ,  n  ,  n) 

rr^(1+7hrfj  m“b1vln«j 

+ 


1  —  a 


7  In  - 

1  Ol 


7  In 

Ting 


7  In-. 
a 
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We  continue,  using  the  fact  that  r  >  *)  1  n  (//(/*  +  1 ) ) . 


1  /  r  In  a 

2  +  ;  ~  + 


1  —  O' 


OK 


;  In  —  <  f  1  +  -](/’  +  *3")  In (?/(/*  +  1 ))) 


n  V 

<  /*  +  3  *  1  n  ( //  ( /*  +  1 ) )  +  4 


The  additive  part  is  identical  to  this  derivation,  except  that  inax{  •  •  • }  is  replaced  by  In  -,  a  factor  of 
7  less. 

If  v  <  7  In  (??(/*+  1)),  then  we  choose  a  =  (r  +  l)"1  and  /i  =  f .  The  competitive  ratio,  then,  is 

l  +  ln(l//i)  /  ,  1  ,  ?? 

- - — - -  max  <  r  In  —.In  — 

(i -m-«)  i  *  * 


< 


JTT77  (]  +  “)  »«ax{/-.ln(«(r+  1))} 

1  ~\J(  max  {(r+  n«  ^  In (»(/"  +  1))  \  • 


We  can  continue,  using  the  facts  that  r  <  -)  1  n  ( » ( r  +  1 ) ) )  and  r  >  1 . 


l  +  ^j  ln(«(r+l)) 

<  2  ,  max  {';  ln(»(r  +  1))  +  1.21n(n(r  +  1))} 

1  -  l/f 

=  - — — T~ril  ln(»(r  +  1))  +  1) 

1  -  1/c 

<  1  2,  ,  (")  In (??(/•  +  1))  +  1) 

1  -  1/c 

<  3.2'ln(7;(r  +  1))  +  3.2 

The  additive  part  is  identical  except  that  the  max { •  •  • }  is  replaced  by  In  ^ ,  a  factor  of  -)  less.  ■ 

Thus,  using  Share,  we  can  achieve  our  pohj{L,  log??)  ratio  for  L -depth  HSTs.  But  we  reach  our 
0(log5  n  log  log  ??)  bound  using  a  different  unfair  MTS  algorithm  called  Odd-Exponent.  We  turn  to  ex¬ 
amining  Odd-Exponent  and  using  it  to  build  a  MTS  algorithm  with  a  polylog(n)  competitive  ratio. 
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Chapter  4 

A  general-metric  MTS  algorithm 


This  chapter  presents  the  polylog(n) -competitive  algorithm  for  metrical  task  systems.  We  begin  by 
examining  a  different  algorithm  Odd-Exponent  for  the  r-unfair  uniform  MTS  problem.  Interestingly,  al¬ 
though  Odd-Exponent  and  Share  are  radically  different  in  approach,  they  share  similar  guarantees.  Share 
is  the  simpler  and  more  intuitive  algorithm,  but  Odd-Exponent  is  an  interesting  alternative  with  slightly 
more  efficient  MTS  guarantees.  In  particular,  with  Odd-Exponent  we  can  guarantee  a  0(log5  n  log  log  n) 
competitive  ratio  on  general  metric  spaces,  whereas  using  Share  gives  us  instead  O  (log '  n  log  log  n).  (The 
difference  is  that  Odd-Exponent  has  a  smaller  additive  part  in  its  guarantee.) 

4.1  Linear 

For  intuition,  we  first  consider  what  we  should  do  for  two  regions.  One  very  good  strategy  (in  fact,  the 
optimal  r-unfair  strategy)  is  to  allocate  to  region  1  the  probability 

1  OPT2  -  OPT! 

Pl  ~  2  +  2  ' 

and  to  region  2  the  remainder.  This  is  the  strategy  that  Blum  et  al.  use  for  equal-ratio  regions  [BKRS92]. 
Its  /  -unfair  competitive  ratio  is  r  +  1;  the  derivation,  analysis,  and  proof  of  optimality  is  identical  to  the 
approach  we  later  see  in  Theorem  4.7. 

For  more  than  2  regions,  the  natural  approach  is  to  generalize  the  2-region  equation.  We  call  this  algo¬ 
rithm  Linear  to  emphasize  the  linear  movement  of  probability  as  the  work  function  changes. 

Algorithm  Linear  We  allocate  to  region  j  the  probability 

«  =  ^  +  ^E(°PTi-OPT/)- 

The  following  analysis  of  Linear  is  simpler  than  the  later  Odd-Exponent  analysis,  but  it  follows  the  same 
basic  method. 

To  simplify  our  analysis  of  these  algorithms,  we  employ  two  assumptions.  The  first  is  to  assume  that 
each  task  vector  is  0  in  all  components,  except  one  component  which  is  bounded  by  8.  We  can  choose  8  to 
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be  as  small  as  we  want.  Such  a  task  is  called  an  elementary  task  or  a  ^-elementary  task.  The  following 
theorem,  not  proven  here,  justifies  this  assumption. 

Theorem  4.1  ([Tom97,  BEY98])  For  any  metric  space  and  any  8  >  0,  if  we  have  a  p-competitive 
MTS  algorithm  assuming  8-elementary  task  vectors ,  then  we  can  construct  a  p-competitive  MTS 
algorithm. 

We  use  the  notation  ( y\  8)  to  represent  a  task  where  j  is  the  state  incurring  a  cost  of  8. 

Our  second  assumption  is  the  following. 

Assumption  4.1  For  an  elementary  task  giving  a  cost  of  8  to  a  state  v  so  that  all  probability  on  v  is 
removed \  we  can  assume  that  8  is  the  least  value  causing  the  algorithm  to  do  this. 

This  is  because  a  larger  8  does  not  alter  the  on-line  cost,  although  it  may  increase  the  off-line  cost.  The 
end  of  Section  6.3  (which  presents  results  of  an  empirical  comparison  of  several  unfair  MTS  algorithms 
including  Odd-Exponent  and  Share)  discusses  how  an  implementation  can  efficiently  incorporate  these 
assumptions. 

Theorem  4.2  The  r -unfair  competitive  ratio  of  Linear  is  at  most  r  +(/?  —  ]). 

Proof.  We  use  a  potential  function 

<J>  =  -f-  (OPT,  -  OPT,)2  . 

2"  iju*j 

and  our  analysis  competes  against  the  average  work-function  value,  ~  OPT,,  which  is  at  most 
1  from  the  true  optimum,  min,  OPT,. 

Say  we  receive  an  elementary  task  vector  where  only  a  state  k  incurs  a  cost  8.  Let  pr  and  p[. 
represent  the  probability  in  region  k  before  and  after  the  task  vector,  and  let  <J>  and  <!>'  represent  the 
potential  before  and  after.  Then  the  on-line  strategy’s  amortized  cost  is 

PkrS  +  (Pk  ~  Pk)  +  $'-$■ 


Assumption  4.1  implies  that  OPT*.  will  rise  by  exactly  8.  Because  pi  decreases  as  a  function  of 
OPT/.,  we  can  upper-bound  this  cost  using  an  integral. 


ry+$ 


Pk  r  ~  - 


dpu 


+ 


o><h 


dOPT,  c>OPTi 


<70PTi 


We  compute  the  integrand. 

Pk-r 


dpi-  d4> 

+  77 


c>OPTa.  (?OPTa 

/•  r 


-+ -E(°pt' - OPT*)  -  U^(_1)  +^(OPTi - OPT) 


i^k 


¥*• 


r  +  n  —  1 


Thus  the  total  cost  is 


r.v+S  r  $ 

— - rfOPT*  =  -(»•+«-  1)  , 

„  n  v 


which  is  r  +  (?)  —  1)  times  the  change  in  ~  W  OPT,  of  . 
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4.2  Odd-Exponent 


Although  r  +  (n  -  1)  is  an  interesting  alternative  to  the  (r  +  I ) Hn  guarantee  of  Marking,  it  falls  short  of 
what  we  need.  By  adding  a  parameter  t  to  Linear  in  a  peculiar  way,  it  turns  out  that  we  get  the  best  of  both 
worlds. 

Before  discussing  the  strategy,  we  first  define  the  odd  exponent  function,  notated  for  any  x  €  R 
and  t  >  0. 


/  x*  if  x  >  0 

\  —  ( — a* ) #  if  x  <  0 


In  our  analysis,  we  use  the  relationship  in  the  derivatives  of  x  ^  and  x  \ 1  (which  we  could  term  the  even 
exponent  function)  for  t  >  1. 


Note  also  that 


4-\*f  = 

ax 

4-z®  = 

ax 


+  |.r|' 
+  (-.r)[,] 


(  2x7  if  x  >  0 
\  0  if  a;  <  0 


(4.1) 


Algorithm  Odd-Exponent  The  strategy  uses  a  parameter  t  >  1.  (Think  t  =  O(logn).)  We 
allocate  to  region  j  the  probability 


ft  =  ;  +  ^EtOPT>-OPTJ»l‘1  • 

i=  1 


(4.2) 


Lemma  4.3  Odd-Exponent  maintains  legal  probability  distributions  ( Pj  —  1  and  each  pj  is 
nonnegative). 

Proof.  It  maintains  JA  pj  =  1,  since  because  a;W  is  an  odd  function,  ^  •  ^;(OPT,'  -  OPT?)W  = 

0.  Because  pj  is  a  decreasing  function  of  only  OPTj  among  the  OPT  values.  Assumption  4.1 
implies  that  each  pj  remains  nonnegative.  (Requests  to  i  ^  j  only  increase  pj .  Say  we  receive  a 
request  (j,  6)  that  would  make  pj  negative  if  OPT,  increased  by  6.  Since  the  distribution  (4.2)  is 
continuous,  there  is  an  5'  <  S  for  which  the  algorithm  sets  pj  to  zero.  Assumption  4. 1  implies  that 
we  can  use  ( j ,  <5')  instead  so  that  pj  becomes  exactly  zero.)  ■ 

In  the  remainder  of  this  section  we  analyze  the  strategy’s  ? -unfair  competitive  ratio  and  then  its  additive 
part. 

To  analyze  the  performance  we  require  a  simple  general  lemma. 

Lemma  4.4  Consider  n  nonnegative  reals  xi , . . .  .  xn  and  two  numbers  1  <  s  <  f.  IfEr  A  <  l 
then  Y2i  X*  <  s^f  . 

This  lemma,  presented  here  without  proof,  is  not  difficult  to  understand.  The  value  of  xi  maximum 
when  all  the  terms  are  equal. 

Theorem  4.5  The  r -unfair  competitive  ratio  of  Odd-Exponent  is  at  most  r  +  2nl^ft. 

Remark.  If  we  choose  t  to  be  In  n,  this  ratio  translates  to  r  +  2e  In  n. 
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Proof.  We  use  two  potential  functions  <I\  and  <I\,, .  The  potential  function  <!>  amortizes  the  local 
cost  within  each  region. 

*<  =  5(777);  EE |OPT'  -  opT:l'+l 

The  other  potential,  amortizes  the  movement  cost  between  regions. 

*.»  =  ^E£i°pt.-opt/ 

'  j 

The  potential  $  for  the  strategy  is  simply  4>,  +  <!»,„. 

Justified  by  Theorem  4.1  and  Assumption  4.1,  we  assume  that,  for  a  request  (k\S),  OPTa 
increases  from  some  value  y  to  y  +  6.  In  this  analysis  the  strategy  competes  against  the  average 
OPT  value,  -  V  •  OPT,.  So  the  off-line  cost  is  -. 

Let  ph  and  p[.  represent  the  probability  in  region  k  before  and  after  the  task  vector,  and  let  4>( 
(4>,„)  and  4>',  )  represent  the  local  (movement)  potential  before  and  after  the  task  vector.  Then 

the  on-line  strategy’s  cost  is 

p'pi'S  +  ( pi.  -  p[.)  +  4><  +  4)',,  —  4>f  —  (1>„,  . 


Because  pi,  decreases  as  a  function  of  OPTa  ,  we  can  upper-bound  this  cost  using  an  integral. 


Pk-r  + 


OOPTa 


dpi  \ 

c) OPTa  +  OOPTa  / 


a/OPTa 


(4.3) 


We  examine  the  first  two  terms,  representing  the  local  cost,  and  the  last  two  terms,  representing  the 
movement  cost,  separately.  In  particular,  we  show  that  the  amortized  local  cost  is  at  most  r/n,  while 
the  amortized  movement  cost  is  at  most  '2nll*t/n. 

For  the  local  cost,  notice  that,  for  any  j. 


dOPT, 


“£<opt'-opt;>['i  =  -(w-;) 


Thus  the  local  cost  terms  are  equal  to  rf  n. 


Pk-r  + 


d$( 

OOPTa 


(4.4) 


Analyzing  the  movement  cost  requires  more  work. 


d]>k  &hi 

OOPTa  +  OOPTa 


~Y  IOPT,  -  OPTa  I'-1 


+  ~Y  (OPT*  -  OPT,-)1'-'1 


—  Y  (OPTa-  -  OPT,)'-1  (4.5) 

71  ' 

OPT,-<OPT* 


The  last  step  follows  from  equation  (4. 1 ).  We  would  like  to  simplify  the  summation.  Say  that  OPTa 
is  currently  the  maximum  OPT  value.  Observe  using  the  probability  allocation  (4.2)  that,  since  pa 
is  not  negative,  the  following  holds. 

Y  (OPTa  -  OPT,)'  =  Y  (°PTa  -  OPT;)1'1  <  1  (4.6) 

i-^a  i^a 
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Because  OPT0  is  maximum,  each  term  of  the  summation  is  positive.  Thus  it  follows  from  Lemma  4.4 
that 

(OPT„  -  OPT,)'-1  <  (n  -  l)x/'  <  nV* . 

i^a 

Using  the  definition  of  a  again  we  continue  from  equation  (4.5)  to  finish  approximating  the  move¬ 
ment  cost. 


—  Y'  (OPT*  -  OPT,)*-1  <  —  V  (OPTa  -  OPT,)'"1  <  ^—1 

opt,<opta-  i+a  (4.7) 


The  estimates  of  the  local  cost  (4.4)  and  movement  cost  (4.7)  bound  the  total  cost  (4.3)  by 


d$e 

dOPTk 


dpk  d$m  \ 

dOPTk  +  dOPTk  ) 


Jy  n 

=  ~(r  +  2n1/tt) 

n 


rfOPTfc 


The  off-line  cost  (according  to  ~  j  OPT?)  is  so  the  amortized  competitive  ratio  is  r  +  2 n^H 
as  desired.  ■ 


To  apply  Odd-Exponent  recursively  on  a  k- HST,  we  must  also  bound  the  additive  part  for  its  r-unfair 
ratio.  We  see  when  we  do  this  that  we  may  want  to  choose  a  large  value  for  t  since  it  reduces  the  maximum 
potential. 

Lemma  4.6  The  additive  part  to  the  ratio  in  Theorem  4.5  is  bounded  by  +  2. 

Proof.  The  additive  part  is  the  maximum  change  in  potential  from  the  beginning,  plus  1  because 
the  proof  of  Theorem  4.5  competes  relative  to  ~  ^  •  OPT*,  which  may  be  as  much  as  1  away  from 
min*  OPT;.  First,  we  bound  4>^.  Let  a  be  the  index  of  the  maximum  OPT  value. 


<f>r  = 


< 

< 

< 

< 


2  (t  +  1  )n 


yyioPT,-opT, 


ii+1 


*  3 


(t+i)»4 


E  E  (°PT'  -  OFIi) 


*+i 


i  OPTj  <OPTt 


(t+  l)n  E- 


E  E  (°PT*  -  OPT>) 


t+i 


OPTj<OPT7 


E  (°PT“  -  OPTj)'h 

3 

E  (OPTa  —  OPTj)' 


t  +  1 


(4.8) 

(4.9) 


Inequality  (4.8)  follows  because,  since  OPTa  <  OPTj  +  1,  each  term  of  the  summation  is  at 
most  one,  so  reducing  the  term’s  exponent  increases  the  term’s  value.  Inequality  (4.9)  comes  from 
equation  (4.6). 
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Bounding  <I>m  is  similar.  Again,  let  a  be  the  index  of  the  maximum  OPT  value. 

=  L££|0PTf-OPT/ 

i  j 

=  (OPT, -OPT,)' 

U  i  OPT,  <OPT, 

<  -J2  5Z  (OPT,,  -  OPT,)' 

"  i  OPT,  <OPT, 

<  J2  (°PT«  -  OPT,)' 

j 

<  J 

Adding  this  to  the  bound  for  4>(  in  equation  (4.9)  gives  the  total  bound  on  the  potential.  To  bound 
the  additive  part,  we  add  1  more  because  ^  Y,  OPT,  may  differ  from  min,  OPT,  by  as  much  as 
1.  ■ 


4.3  Two-Region 


Currently  we  have  a  technique  (actually,  two)  for  guaranteeing  a  poIy(L ,  log  v )  ratio  for  HSTs,  where  L  is 
the  depth  of  the  tree.  It  would  be  nice  if  we  could  guarantee  that  L  —  polylofj(v),  and  indeed  for  many 
restricted  sets  of  metric  spaces  we  can;  but  such  a  guarantee  for  general  metric  spaces  is  impossible  to 

make.  For  example,  if  we  lay  points  at  1.2.4 . 2”"1  on  an  axis,  the  resulting  h- HST  must  have  depth 

fi(log/7  2n).  Thus,  although  we  have  made  solid  progress  toward  the  poljjlofj(  v)  ratio,  we  need  new  ideas  to 
achieve  it.  These  appear  in  the  remainder  of  this  chapter. 

The  main  remaining  idea  is  more  of  a  convoluted  hack  than  an  elegant,  final  answer.  The  idea  is  simple: 
A  tree  with  more  than  polylog  (n)  levels  must  have  nodes  whose  subtrees  are  very  unbalanced  —  one  subtree 
has  many  more  leaves  than  any  of  the  others.  Or,  since  competitive  ratios  are  strongly  tied  to  the  tree  size, 
we  can  reword  it  in  the  jargon  of  unfairness:  We  want  to  handle  the  case  where  the  cost  ratios  are  different 
for  different  points  in  the  space.  (We  have  until  now  always  assumed  they  are  equal  for  all  points.) 

Having  different  cost  ratios  for  different  points  appears  to  be  a  complex  issue.  But  there  is  one  par¬ 
ticularly  simple  case  that  we  can  tackle:  the  case  of  having  only  two  points  with  separate  cost  ratios.  We 
can  utilize  this  in  building  a  strategy  for  the  HST:  Where  the  subtrees  are  all  roughly  the  same  size,  we  can 
still  use  Odd-Exponent  profitably;  but  where  one  is  much  larger,  we  can  combine  all  but  the  largest  using 
Odd-Exponent  and  then  apply  Two-Region  to  combine  this  combination  with  the  largest  subtree. 

We  first  look  at  the  unusual  two-point  unfair  scenario  and  present  Two-Region  as  our  algorithm.  In  this 
problem,  one  point  has  unfairness  r  \  while  the  other  has  unfairness  r2.  Blum  et  cil  consider  this  scenario,  but 
their  analysis  does  not  have  to  worry  about  the  additive  constant  [BKRS92].  Seiden  [Sei99]  independently 
develops  the  same  algorithm. 

Algorithm  Two-Region  ([BKRS92,  Sei99])  Without  loss  of  generality,  say  /q  >  r2.  Let  ft 
represent  eri“r2,  and  define  p\{y)  as  follows. 


pi{y)  = 


/}-  l 


(4.10) 


After  computing  the  work  function  OPT,  we  place  ]>\  (OPT |  -  OPT2)  probability  in  the  first 
region  and  the  rest  in  the  second. 
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While  the  strategy  is  hardly  intuitive,  the  analysis  will  make  the  reason  for  the  selection  clear. 
Theorem  4.7  The  competitive  ratio  of  Two-Region  is 


ri  + 


?T  -  r2 

eri-ro  _  l 


The  additive  part  is  at  most  r2  +  2. 

Proof.  Because  pi  (1)  =  0  and  Pi{—1)  =  1,  this  algorithm  does  not  have  the  problem  of  allocating 
nonnegative  probability  to  a  pinned  state. 

What  we  will  show  is  that  for  a  given  potential  4>,  for  any  task  t,  the  cost  to  Two- Reg  ion  is 
bounded  by 


CO.S'^Two— Region  +  (**  -  S*"1)  <  +  'f- (OPT1,  -  OPT*"1)  . 

This  means  that  to  achieve  the  ratio,  the  potential  must  entirely  absorb  the  cost  any  time  the  second 
state  incurs  some  cost.  We  define  the  potential,  therefore,  as 

$(y)  =  (1  -Pi{y))  +  r-2  J  (1  -Pi(y))  dy , 

and  the  potential  as  $(OPTj  —  OPT2).  This  potential  completely  absorbs  all  increases  to 

opt2. 

Let  us  consider  a  request  that  increases  OPTi  from  z  to  s  +  S,  The  strategy’s  amortized  cost 
for  this  request  is  at  most 

J_  +  ^  J  (pi(y)n  -  2^  +  {1  -  Pi{y))r^j  dij 

(The  integral  approximates  the  cost  because  pi  is  a  decreasing  function.)  By  setting  this  to  a  constant 
we  obtain  a  first-order  differential  equation  in  pi ,  which  can  be  solved  with  the  boundary  conditions 
Pi(l)  =  0  and  pi  (-1)  =  1.  The  solution  is  as  in  equation  (4.10).  It  is  easy  to  verify  that  this  results 
in  a  constant  integrand. 


2+5 


Pi(y)ri  -  2^  +  (i  -  Pi(y))r2 


dy  = 


Since  the  off-line  player  pays  S,  the  competitive  ratio  for  the  strategy  is  as  advertised. 

To  bound  the  additive  part,  we  note  how  widely  the  potential  can  vary.  Because  always  y  >  -1, 
the  potential  is  always  nonnegative.  The  potential  is  largest  when  y  =  1.  In  this  case  the  first  term 
is  1  and  (using  some  straightforward  calculus)  the  second  term  is  at  most  r2.  Thus  the  potential  is 
at  most  r2  +  1.  Since  OPTi  differs  from  the  optimal  cost  by  at  most  1,  the  additive  part  is  at  most 
r2  +  2.  ■ 
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4.4  Building  the  polylog  ( n )  algorithm 


As  in  Theorem  2.7,  we  build  our  algorithm  for  the  HST  inductively.  In  building  the  algorithm,  we  modify 
the  HST  so  that  the  distance  between  any  two  points  does  not  decrease  but  may  increase  to  twice  the  initial 
distance.  This  costs  us  only  a  factor  of  2  in  the  overall  ratio. 

Theorem  4.8  For  an  h  -HST  with  h  >  8000 1  n 2  77,  we  can  modify  the  HST  so  that  distances  at  most 
double  and  for  the  modification  we  have  an  on-line  MTS  algorithm  with  a  competitive  ratio  of  at 
most  1000  In2  v  with  an  additive  2000 D  In2  n,  where  D  is  the  diameter  of  the  modified  tree . 

Remark.  The  following  proof  draws  heavily  on  the  technical  details  already  discussed  in  Theo¬ 
rem  2.7.  Understanding  that  proof  is  essential  to  understanding  the  following. 

To  avoid  complications,  this  theorem  employs  intentionally  generous  constants. 

Proof.  We  prove  this  inductively  on  the  tree,  with  the  base  case  being  the  trivial  single-node  tree.  For 
the  induction  step,  we  let  r;  be  the  ratio  for  subtree  /,  with  the  subtrees  ordered  so  /]>•••>  r&.  We 
define  n;  as  the  number  of  points  in  subtree  /.  Finally,  n  is  the  total  number  of  points  in  all  subtrees. 

The  induction  step  has  two  cases,  depending  on  whether  r\  is  below  1000  In2  /?  -  50  In  n  (in 
which  case  the  subtrees  are  balanced  enough  to  simply  apply  Odd-Exponent)  or  above  (in  which 
case  the  subtrees  are  very  unbalanced) . 


Case  1.  If  V\  <  1000  In2  n  -  50  In  ??,  then  our  strategy  is  to  first  mutate  the  tree  by  doubling  the 
distances  from  the  root  node  to  the  points.  We  apply  Odd-Exponent  to  combine  the  subtrees  using 
t  =  In  n. 

To  bound  the  competitive  ratio,  we  observe  that  for  an  arbitrary  action  sequence  v,  implying  an 
action  sequence  v&  for  moving  between  subtrees,  the  off-line  cost  is  at  least 


local (T a  i  Vfc)  + 


morc(v[})  =  local (Ta*  v^)  +  QT)  move  [vfj  , 


where  we  define  Q  as  1  —  jr.  Meanwhile,  given  the  action  sequence  V4  used  by  the  on-line  algorithm, 
the  expected  cost  is  at  most 

(1000  In2  ??  —  50  In  ??)  local ( T4,  V4)  +  +  2000“  In2  77^  mocc(v.\)  +  2000— In2  77 


5D 


D 


<  (1000  In2  ?7  -  50  In  77)  local (T a,  Y4)  +  —^move (V4)  +  —  . 

We  can  use  Odd-Exponent's  unfair  competitive  ratio  to  bound  the  on-line  cost  in  expectation  over 
Odd-Exponent’s  random  choices. 


E 


(1000  In2  7?  —  50 111 7?)  local (Ta,  Y4)  +  move  ( V4 )  +  ^ 


E 


(800C In2  ?7  —  40C hi  ??)  local  ( T4.  V4  ]  +  i))orc(v.\ ) 


< 


5  D 
4 

5  D 
4 


800C  In2  n  —  40C  In  v  \  D 

+  +2)  +  t 


C  D 


+ 


I) 


(800C In2  n  -  40C In  11  +  5.5  In  n)  floral  ^ .  V/,) 


<  (1000  In2  n)  (local (T a-  Vb)  +  C D  tnorr(vb))  +  1000(,'D  In  v  +  — 

Thus  we  have  satisfied  our  inductive  hypothesis. 


+  movc(vb) 


D 
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Figure  4.1:  Transformation  for  Case  2  of  Theorem  4.8. 


Case  2.  If  i'i  >  1000  In2  n  -  50  In  n,  then  our  strategy  is  to  first  mutate  the  tree  by  splitting  the 
root  node  into  two  nodes  y  and  z,  where  subtrees  2  through  b  are  subtrees  of  z,  while  the  subtrees  of 
y  are  subtree  1  and  the  tree  rooted  at  z.  (Figure  4.1  illustrates  this.)  The  distances  in  the  tree  rooted 
at  z  remain  the  same,  but  y  is  lifted  so  that  its  distance  from  the  leaves  doubles.  At  z,  the  ratio  of  the 
largest  subtree’s  diameter  to  the  overall  tree’s  diameter  is  at  most  |  (since  we  doubled  the  diameter 
of  the  subtrees  in  the  inductive  step);  at  y,  this  ratio  is  2. 

Our  algorithm  is  to  use  Odd-Exponent  to  combine  the  subtrees  of  2  (choosing  t  —  2  In  n), 
and  to  use  Two-Region  to  combine  the  subtrees  of  y.  To  analyze  the  competitive  ratio,  we  first 
analyze  the  tree  rooted  at  2  and  then  the  tree  rooted  at  y.  We  assign  x  so  that  n\  =  (1  -  £)??;  since 
r  1  >  1000  In2  n  —  50  In  n  and  rj  <  1000  In2  n  1 ,  we  can  deduce  that  40  <  x  <  n. 

For  the  tree  rooted  at  z,  we  observe  that  for  an  arbitrary  action  sequence  v,  implying  an  action 
sequence  v&  for  moving  between  subtrees,  the  off-line  cost  is  at  least 


local (Ta,  Vft)  +  1  D  — 


2D 


moveiyi)  —  local  (Ta.  vj)  +  QD  move(vb) , 


where  we  define  (  as  1  -  Meanwhile,  given  the  action  sequence  V4  used  by  the  on-line  algorithm, 
the  expected  cost  is  at  most 

1000 1112  — )  local(T\,  V4)  +  f  D  -I-  2000 In2  \  move(vA)  +  2000^^  In2  — 

V  x>  \  h  x )  '  h  x 

<  ^1000 In2  ^  local (Ta,va)  +  ^ move(vA )  +  y  • 

We  can  use  Odd-Exponent’s  unfair  competitive  ratio  to  bound  the  on-line  cost  in  expectation  over 
Odd-Exponent’s  random  choices. 


E 


(lOOO In2  ^  /oc«/(Ta,  V4)  +  ^-move(vA)  +  y 
3  D 

K  I  I  - -  In*  —  I  local  I 

C  D 


D 


/2000C  ln2  (A  local  (-^j-Ta,  va  )  +  move {v A) 


D 
+  7 


<  ^ in2  71  +  6  6 in  ^ local  +  move(vb +  - 


2000C In 2 ^ 


<  ^1000 In2  —  +  -y In n'j  (local(TA, v;,)  +  £Dm.ove(vb))  +  500C-Dlnn  +  ^-D 

Thus  the  competitive  ratio  r~  for  the  tree  rooted  at  z  is  at  most  1000  In2  ~  ~  In  n ,  with  an  additive 

part  of  at  most  500C-D  In  n  +  |£>. 
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For  an  arbitrary  action  sequence  v,  implying  an  action  sequence  v&  moving  between  the  two 
subtrees  of  t/,  the  off-line  cost  is  at  least 

local (T a.  Vfc)  +  ~morc(vb)  . 

For  the  on-line  algorithm,  given  that  it  uses  the  action  sequence  v 4,  the  expected  cost  is  at  most 


r  1  local i(T a a)  +  r-Zorr/^ (T4,  V4)  +  ^ D  +  -  ^500CDln  v  +  ^ D  +  500—  In2  77^  move(vd 4) 
+  500£D  In  77 


-D  +  500“  In2  /? 

2  /7 


(“ 


<  r  1  local  1  (T4.  V4)  +  r:: local) ( T| .  V4 )  +  (  250 /Jin  »  +  —77^  nioc(  {v  \ )  +  500 7)  In  rt  +  ~-D  . 


Here  local  \  (T4,  V4)  represents  the  total  cost  incurred  at  point  1  with  the  task  sequence  T4  using 
the  action  sequence  V4.  (The  peculiar  movement  cost  comes  from  the  fact  that  half  of  the  move¬ 
ments  involve  the  additive  cost  of  500  j-  In2  »  and  half  involve  the  additive  cost  500(77  In  n  +  -jD.) 
We  can  find  the  expectation  over  Two-Region's  selection  of  V4  by  using  the  competitive  ratio  of 
Two-Region.  In  the  following,  we  let  a  represent  250  In  v  + 


E 


r  1  local  1  (T.4,  V4)  +  r~local2{TA.  V4)  +  aDmovc  (V4)  +  500Dln  »  +  -77 


=  a77E 


n_ 

2o 
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77 


r~ 


local  1  (  — T4.V4  )  +  —-local)  — T4.V4  +  morc(\ 4) 


2o 


2 


77 


+  50077  In  n  +  -77 


5  oD  ((£ + #iTr)  (Wi  (lT-''vj) + (bTj  V4) + ,,k"’,(vj))  +  (2fe + 1 


+  50077  In  «  +  -77 


-  r1  + 


n  -  /*, 


V  €  -  1  ) 

+  (r~D  +  oD)  +  500Dln  77  +  2D 


local i  (T4,  V4)  +  local-2 ( T4,  V4)  +  y/??orr(v4) 


Thus  our  computed  competitive  ratio  is  r\  +  ( r{  -  rz)/(c^l~r=^2°  -  1 ).  We  want  to  bound  this 
by  1000  In2  n.  To  do  this,  we  first  bound  7*1  —  rz  from  below  (since  x  / (rr/2°  -  1 )  decreases  as  x 
increases  beyond  2a). 


7'x  —  r~  >  (1000  In2  n  —  50  In  ??)  -  (  1000  1 


,  9  77  10,  \ 

In2  — | — —  In  77  ) 

;7-  (,  ) 


—  2000  In  x  In  7?  -  1000 1112  x  -  50  In  77  -  In  7? 
>  900  In  x  In  7? 


We  use  this  to  bound  the  ratio. 
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Likewise,  we  can  bound  the  additive  part 

+  250  In  n  +  ^  +  500  In  n  +  ^  D  , 

which  is  less  than  2000Dln2  n. 


4.5  Extensions 

We  can  extend  Theorem  4.8  in  two  ways:  We  can  try  using  Share  instead  of  Odd-Exponent,  and  we  can 
look  at  what  happens  in  specific  metric  spaces. 

An  alternative  algorithm 

Theorem  4.8  did  not  rely  on  any  specific  properties  of  Odd-Exponent.  But  if  we  were  to  apply  Share 
instead,  the  ratio  would  suffer  due  to  the  additive  part:  Share  has  an  additive  r,  while  Odd-Exponent  has 
an  additive  r/log  n.  The  rj log  n  additive  part  is  necessary  in  Case  2  to  get  a  manageable  additive  part  for 
the  tree  rooted  at  z. 

When  we  adapted  Share  to  MTS  (Theorem  3.5),  we  had  a  parameter  7,  and  in  fact  the  additive  part  was 
Taking  7  =  log  n,  the  unfair  competitive  ratio  is  r  +  log2  n.  So  if  we  use  Share  in  proving  an  O  (log3  n) 
bound  on  an  Q(log3  ??.)-HST,  we  can  get  a  working  theorem.  The  net  result  is  an  0(log7  n  log  log  n)  bound 
for  general  metric  spaces. 

Alternative  spaces 

For  many  restricted  sets  of  metric  spaces,  the  bound  improves  by  using  Odd-Exponent  (or  Share,  with 
a  penalty  due  to  the  additive  part)  on  Theorem  2.7.  For  example,  if  the  metric  between  states  comes 
from  the  shortest-path  metric  on  an  unweighted  graph  on  states,  we  know  the  depth  of  the  HST  must  be 
O (log/j  n) ,  so  we  can  get  a  ratio  of  O  (log2  n/ log  log  n)  on  an  f2(log  n)-HST.  Since  we  can  probabilistically 
0(log2  n  log  log??) -approximate  unweighted  graphs  by  Q(log  n)-HSTs,  we  get  a  result  of  0(log4  n)  for 
unweighted  graphs. 

We  can  do  even  better  for  HSTs  that  are  “roughly  balanced”  in  the  sense  that,  at  any  node  with  b  subtrees 
covering  a  total  of  n  nodes,  the  largest  subtree  contains  1.4y  nodes.  In  this  case,  we  can  use  the  inductive 
hypothesis  that  the  ratio  is  4r  In  n:  The  largest  subtree  has  a  ratio  r  of  at  most  4eln  1.4j,  and  so  Odd- 
Exponent  combining  the  b  subtrees  has  ratio  r  +  2c  In  b  <  4e  In  n.  The  additive  part  is  0(log  n),  so  we 
require  an  $2  (log  ?i)-HST  for  this  to  work. 

This  “roughly  balanced”  property  arises  in  mesh  spaces,  like  a  line  space:  We  can  0(log2  n/log  log  n)- 
approximate  such  a  space  with  roughly  balanced  f2(log  ?i)-HSTs  [Bar96].  Thus  for  mesh  spaces,  we  get  a 
ratio  of  O  (log3  nj log  log  n) . 
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Chapter  5 

Combining  on-line  algorithms 


Now  we  switch  away  from  the  general-metric  MTS  problem;  instead  we  pick  up  on  the  theme  of  Sec¬ 
tion  3.4  and  extend  the  application  of  Experts  algorithms  to  competitive  analysis.  In  each  of  Chapters  5, 
6,  and  7,  we  extend  the  result  in  a  different  way;  in  this  chapter,  we  examine  applications  to  the  problem  of 
combining  on-line  algorithms  on-line. 

Problem  Combine-Online  Given  are  a  variety  of  on-line  algorithms  A\.  A-2, ... ,  An,  each  in¬ 
curring  losses  during  each  time  step.  At  all  times,  our  on-line  algorithm  chooses  to  follow  one  of 
these  algorithms,  incurring  that  algorithm’s  losses,  but  between  time  steps  the  on-line  algorithm  may 
choose  to  switch  between  algorithms  at  a  switching  cost  of  d.  Our  hope  is  that  on  any  sequence  the 
algorithm  will  not  do  too  much  worse  than  the  best  of  the  A%  for  that  sequence. 

Example  5.1  Say  we  have  a  variety  of  paging  algorithms  like  LRU,  Marking,  and  MRU.  On  any  se¬ 
quence  of  page  requests,  we  want  to  do  about  as  well  as  the  best  among  them  in  hindsight.  One  way 
of  doing  this  is  to  follow  algorithms’  internal  caches,  allowing  the  on-line  algorithm  to  switch  be¬ 
tween  caches.  The  cost  to  switch  between  caches  is  at  most  the  size  of  the  cache  (usually  represented 
by  k). 

This  problem  is  similar  to  one  arising  in  Azar,  Broder,  and  Manasse,  with  the  difference  that  they  do 
not  incorporate  a  fixed  switching  cost  d  [ABM93].  Instead,  in  their  problem,  when  the  Combine-Online 
algorithm  switches  from  one  algorithm  A;  to  another  Aj,  the  algorithm  may  pay  as  much  as  the  total  cost 
paid  so  far  by  A,;  and  by  Aj.  This  is  because  they  are  primarily  concerned  with  combining  algorithms  for 
the  /.‘-server  problem,  where  the  algorithms  are  moving  within  an  unbounded  metric  space,  and  so  in  moving 
between  algorithms  the  Combine-Online  algorithm  may  have  to  move  all  the  way  back  to  the  initial  point 
of  the  space  (which  is  at  most  the  total  cost  paid  by  A,)  and  then  to  the  point  currently  occupied  by  Aj.  The 
guarantee  they  achieve  for  this  more  difficult  scenario  is  that  their  algorithm  can  guarantee  it  pays  no  more 
than  O(logn)  times  the  best  of  the  on-line  algorithms  it  is  combining.  (If  the  metric  space  has  a  bounded 
diameter  D,  then  we  could  alternatively  apply  the  results  of  this  chapter  to  the  problem  using  d  =  D.) 

What  we  will  see  is  that  in  our  formulation,  an  algorithm  (using  Experts  algorithms)  can  do  nearly  as 
well  as  the  best  single  algorithm.  In  particular,  if  the  best  algorithm  Ak  incurs  a  total  cost  of  L,  then  our 
combination  algorithm  will  pay  at  most  (1  +  e)L  +  (1  +  ^)d  log  n. 
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Example  5,2  Another  application  is  to  the  List-Update  problem  [BM85,  ST85a].  In  this  problem, 
our  algorithm  maintains  a  list  L  over  n  elements.  Each  time  step,  the  algorithms  receives  a  request  to 
one  of  the  n  elements  and  pays  1  for  each  step  that  must  be  made  in  the  list  to  find  the  element.  The 
algorithm  also  pays  1  each  time  it  transposes  two  adjacent  elements  in  the  list,  unless  that  transpose 
moves  the  element  of  the  current  access  forward  in  the  list. 

List-Update  is  a  classical  problem  in  competitive  analysis.  Move-To-Front  is  one  of  the  sim¬ 
plest  algorithms:  On  each  time  step,  Move-To-Front  moves  the  just-accessed  item  to  the  head  of  the 
list.  This  algorithm  has  a  competitive  ratio  of  2  -  —y  [BM85,  ST85a,  Ira91].  Kaip  and  Raghavan 
show  that  no  deterministic  algorithm  can  guarantee  less,  even  against  a  static  adversary  (who  is  not 
allowed  to  alter  the  list  ordering)  (reported  in  [Ira91  ]). 

The  situation  for  randomized  algorithms  is  less  certain.  The  best-known  algorithm  is  Comb, 
with  a  competitive  ratio  of  1.6  against  dynamic  adversaries  [AvSW95].  No  algorithm  can  achieve  a 
ratio  of  less  than  1 .5-  ^  [Tei93].  For  static  adversaries,  no  lower  bound  for  randomized  algorithms 
is  known,  nor  is  there  a  better  bound  than  that  for  Comb. 

By  demonstrating  a  (massively  inefficient)  algorithm,  the  results  of  this  chapter  imply  that  no 
such  lower  bound  is  possible  for  static  adversaries.  We  can  have  an  algorithm  for  each  of  the  n\ 
possible  lists;  the  algorithm  for  list  L  statically  keeps  L  as  its  list.  The  switching  cost  between 
algorithms  is  at  most  (”).  From  this,  we  get  a  ratio  of  (1  +  s)  for  any  fixed  s.  (This  algorithm, 
as  stated,  is  extraordinarily  impractical.  We  are  taking  advantage  of  the  fact  that  the  on-line  model 
does  not  count  the  time  spent  deciding  which  item  to  move.  A  simpler  and  more  efficient  algorithm 
achieving  a  similar  guarantee  would  be  an  interesting  result.) 

This  observation  extends  naturally  to  the  Dynamic-Tree  problem,  where  the  on-line  algorithm 
is  permitted  to  rearrange  a  binary  search  tree  by  rotations  along  the  path  to  the  accessed  node. 
Sleator  and  Tarjan  demonstrate  that  their  Splay-Tree  algorithm  is  0(  i  )-competitive  against  a  static 
adversary  [ST85b].  But  by  having  a  separate  algorithm  for  each  possible  tree,  the  results  of  this 
chapter  demonstrate  that  one  can  in  fact  be  (1  +  5  )■ -competitive  if  we  do  not  count  the  time  spent 
deciding  how  to  rearrange  the  tree.  (For  Dynamic-Tree  against  dynamic  adversaries,  the  lowest 
known  bound  is  0(log  ??);  for  example,  a  static  balanced  tree  achieves  the  bound  [Ig  ■»].) 


5.1  Simulating  all  algorithms 

If  our  on-line  algorithm  can  afford  to  simulate  all  ri  algorithms,  then  we  can  apply  Hedge  in  the  straight- 
forward  way:  We  follow  the  probabilities  that  Hedge  uses.  When  we  get  an  event  and  see  how  the  different 
algorithms  will  process  it,  we  give  those  same  losses  to  the  algorithms’  corresponding  experts  for  Hedge. 
We  then  change  our  probability  distribution  according  to  Hedge,  moving  to  a  new  algorithm  as  Hedge 
directs. 

Theorem  5.1  Say  that  the  best  algorithm  has  a  total  loss  of  L.  Then  the  loss  of  Hedge  is  at  most 
E[/os,-Hedge]  <  1  Y'",1'  '  ((in  2)  L  +  In  n)  . 


Remark.  Say  we  choose  /?  =  1  -  ^  for  some  s  >  0.  Then  the  above  bound  translates  to 
approximately  (1  +e)L  +  (1  +  7)*/ In  n.  That  is,  we  are  (1  +  ?) -competitive  with  respect  to  the  best 
on-line  algorithm. 

Proof.  Say  algorithm  Ak  is  the  best  algorithm.  We  consider  separately  the  local  cost  (that  is,  the 
amount  spent  by  our  on-line  algorithm  due  to  the  algorithm  it  currently  occupies)  and  the  switching 
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cost  incurred  by  our  on-line  algorithm.  Theorem  3.2  shows  that  the  local  cost  satisfies 


E[/oSS Hedge]  <  ^ 


(In  fact,  we  get  one-step  lookahead;  this  only  decreases  the  algorithm’s  cost.) 

We  now  show  that  the  movement  cost  is  at  most  d  In  4  times  the  local  cost.  At  step  t,  we  expect 
to  pay  d  t  (p --1  -  P;)  for  movement,  where  p,  is  the  probability  the  on-line  algorithm  is 

following  algorithm  i  at  time  step  t. 


i  E  (P;-1  —  Pi) 

*;p]  — 1  >P; 


=  "  E 


w 


t-l 


s  <  E 


i:p[  *>p{ 


W' 

w/-i  w*-i 


W'-1 

v 

Tt-  1 


W: 


< 


w 


/-I 


w 


t- 1 


\  w,_1  W'-1 


Since  Y;  Pi  X£\  is  the  expected  local  cost,  we  have  achieved  our  goal. 


5.2  Running  only  one  algorithm 

The  problem  becomes  more  intricate  when  we  can  run  only  one  of  the  n  algorithms  at  a  time.  Such  may  be 
the  case,  for  example,  if  we  are  combining  several  paging  algorithms  but  the  system  cannot  afford  the  time 
required  to  simulate  all  of  the  algorithms  in  order  to  maintain  their  losses. 

This  is  a  version  of  the  Bandits  problem  studied  by  Auer,  Cesa-Bianchi,  Freund,  and  Schapire  [ACBFS95, 
ACBFS98].  Bandits  is  a  variant  of  Experts,  where  each  time  step  the  algorithm  can  see  the  loss  of  only 
the  expert  chosen.  (The  problem’s  name  derives  from  slot  machines.)  Auer  et  al.  show  that,  by  mixing 
the  Hedge  distribution  appropriately  with  the  uniform  distribution,  they  can  guarantee  a  loss  of  at  most 
()(  \/Tn  log  n)  more  than  the  best  expert’s  loss,  where  T  is  the  number  of  time  steps. 

To  mesh  better  with  the  phrasing  of  Auer  et  aids,  we  consider  the  scenario  where  each  time  step  every 
expert  incurs  a  reward  in  [0,1],  and  we  wish  to  maximize  our  gain.  Our  scenario  adds  to  theirs  the  concept 
of  a  switching  cost  d,  which  works  as  follows:  In  time  round  t,  expert  i  has  a  true  gain  x*  in  [0, 1],  but  the 
gain  the  algorithm  actually  sees  is  an  approximation  to  this  called  the  observed  gain  x*  (also  in  [0, 1]).  The 
true  gain  and  the  observed  gain  are  related  in  that,  if  the  algorithm  remains  at  a  single  expert  from  f0  to  t\ , 
then  the  total  observed  gain  at  most  ^  less  than  the  total  actual  gain  Y?t=t0  xj-  (This  somewhat 

convoluted  way  of  incorporating  the  switching  cost  comes  from  the  paging  case  in  Example  5.1 .  When  we 
switch  from  one  algorithm  to  another,  we  do  not  know  the  actual  cost  incurred  by  the  new  algorithm,  since 
we  have  not  kept  track  of  where  it  is.  Our  model  assumes  that  all  the  algorithms  have  the  property  that, 
regardless  of  the  request  sequence,  the  initial  cache  cannot  affect  the  total  cost  by  more  than  cl.) 


45 


This  switching  cost  removes  the  luxury  (which  Auer  et  al.  enjoy)  of  choosing  an  expert  independently 
each  time  round,  because  switching  as  often  as  this  implies  is  quite  expensive.  One  possible  solution  to 
this  problem,  which  we  pursue,  is  to  divide  time  into  segments  of  s  steps.  (We  choose  s  later.)  We  choose 
independently  from  the  distribution  at  the  beginning  of  each  time  segment,  and  we  stay  there  for  the  duration 
of  the  segment.  Behaving  in  this  way  is  equivalent  to  running  Auer  et  al  ’s  algorithm  for  ~  time  steps,  where 
in  each  step  an  expert’s  maximum  loss  is  at  most  .s,  rather  than  only  1. 

Algorithm  Hedge-Bandit  The  algorithm  has  two  parameters,  s  and  fi.  For  each  time  segment  of 
s  steps,  the  algorithm  does  the  following. 

1.  We  choose  one  expert  ir  for  the  time  segment  t  (time  steps  ts  through  (/  +  1  )*)  based  on  the 
probabilities 

pY1 

j  j  » 

where  p^”1  is  the  probability  distribution  used  by  Hedge. 

2.  We  observe  the  gain  xjf  for  the  segment.  (For  j  ^  if,  we  take  x(i  to  be  0.) 

3.  We  let  Xy  =  x'/p'-1,  and  give  this  vector  x/  to  Hedge  in  order  to  compute  p'  for  the  next 
time  segment. 


Analyzing  this  algorithm  requires  the  following  theorem  of  Auer  et  al  generalizing  the  bound  on 
Hedge’s  performance  (Theorem  3.2)  to  the  case  when  an  expert’s  gain  may  be  as  much  as  M  per  time 
step. 

Theorem  5.2  If  each  of  a  set  of  n  experts  experiences  a  sequence  x  j  of  gains  in  [0,  A/],  then  Hedge 
configured  with  ji  €  (0,1)  has  expected  gain  ]T  ■  p*-x*  of  at  least 
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for  all  experts  k . 

We  use  this  in  the  following  theorem  bounding  the  performance  of  Hedge-Bandit. 

Theorem  5.3  The  expected  gain  of  Hedge-Bandit  is  at  least 

G  —  (1  —  ") )— r/  —  - - -sn  In  n  -  (c  —  1  )^G  , 

.s  0 

where  G  is  the  largest  total  actual  gain  acquired  by  any  single  expert ,  and  where  fi  =  Glsn . 

By  choosing  appropriate  values  for  y  and  $  as  described  in  the  following  corollary,  we  bound  our  gain 
relative  to  the  best  of  the  algorithms. 

Corollary  5.4  The  expected  gain  of  Hedge-Bandit  is  at  least 

G  —  3.6  V (InT2  In  n  , 


where  G  is  the  largest  total  actual  gain  acquired  by  any  single  expert ,  if  we  choose  the  parameters 


7 


The  proof  of  Theorem  5.3  closely  follows  the  technique  used  by  Auer  et  al.  [ACBFS98]. 
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Proof  of  Theorem  5.3.  Let  k  be  the  expert  acquiring  the  largest  total  actual  gain.  Because 

for  any  expert  j  in  any  time  segment  t,  the  scaled  observed  gain  x*  =  x'/p'  is  at  most  sn/y.  So  we 

take  M  to  be  sn/ 7  (and  recall  /3  =  e'i'sn)  in  applying  Theorem  5.2  for  the  following  bound. 


sk  ~i~i  ^  E  ~i  l’1  n  -  1  -  M In  /?  dE  ,^t . 

LLp«  £  E**-^ - wj-j — EEp.M 

1=1  3=1  -  .... 


1  ldt\2 


1= 1 
T/s 


M2 111  (3 


1= 1  j=l 


=  E^-^-^EE^)2 


1=1 


(5.1) 


1=1  j=i 


Now,  because  p*  >  (1  -  y)py,  we  can  observe  the  following. 
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We  use  both  of  these  facts,  along  with  (5.1)  and  the  relationship  of  the  observed  gains  x  to  the  actual 
gains  x,  to  bound  the  total  gain,  J2tl  1  x[(  • 
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To  get  the  expected  gain  E  []T#  x[(] ,  we  first  observe  that  E  x*  equals  E  x*  : 

E  [xj]  =  Eii  _  ,if— 1  [Ejt  [Xj  |  i1, . . .  )i<_1]] 
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We  continue  from  (5.2),  using  the  fact  that  the  observed  gain  x*  is  between  the  actual  gain  x*  and 
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Chapter  6 

Relating  MTS  and  Experts 


A  second  way  of  extending  the  results  of  Section  3.4  is  to  consider  the  converse  question:  How  do 
MTS  algorithms  perform  on  the  Experts  problem?  Besides  the  academic  and  historic  interest  in  such  a 
question,  the  work-function  approach  used  in  metrical  task  systems  —  a  very  different  approach  from  the 
multiplicative  weight-updating  technique  studied  for  Experts  up  to  now  —  may  prove  more  useful  in  some 
learning  situations. 

In  this  chapter,  we  first  look  at  a  generic  theorem  translating  any  /--unfair  MTS  algorithm  into  an  Ex¬ 
perts  algorithm.  Then  we  illustrate  an  analysis  of  one  particular  MTS-derived  algorithm  (Linear  on  two 
points/experts)  in  the  Experts  problem.  And  finally  we  look  at  a  small  empirical  comparison  of  how  our 
large  set  of  Experts/MTS  algorithms  performs  on  real  data  inspired  by  process  migration. 


6.1  General  relation 

As  Section  3.4  illustrates,  achieving  an  unfair  competitive  ratio  for  the  uniform  MTS  problem  is  similar  to 
achieving  a  partitioning  bound  in  the  Experts  setting.  The  parameter  r  allows  us  to  trade  off  the  Lp  and  kp 
coefficients,  similarly  to  f3  in  Thresh  and  Share. 

Conversion  from  MTS  to  Experts 

The  following  theorem  makes  the  relationship  formal. 

Theorem  6.1  Let  A  be  a  randomized  algorithm  for  the  MTS  problem  on  the  n-point  uniform  space 
that,  given  r,  achieves  an  r -unfair  competitive  ratio  of  p7hr.  Then  this  implies  an  algorithm  A!  for 
the  Experts  setting  has  expected  loss  at  most 

~ ~Lp  +  pn,rkp  +  b  , 

r 

for  any  partition  Py  for  some  constant  b  that  may  depend  on  r  and  n  ( typically ,  b  <  r). 

Remark.  Note  that  if  pn,r  =  r  +  log  n  and  e  =  Mog  n,  then  this  partitioning  bound  translates  to 
(1 +e)Lp+  (1  +  ~)kp  log  n,  analogous  to  the  bound  that  Thresh  and  Share  achieve  (Theorems  3.3 
and  3.4). 
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Proof.  At  each  time  step,  our  algorithm  Af  uses  whatever  distribution  A  currently  has.  When  it 
receives  loss  vector  £\  it  gives  a  scaled  version  to  ,4  so  that  .4  can  modify  its  distribution  for  A' 
to  use  in  the  next  time  step. 

Consider  any  sequence  of  loss  vectors  £  and  any  partition  P.  Let  pf  represent  the  probability 
vector  on  states  that  .4  uses  for  the  tih  time  step  (and  which  A '  uses  for  the  (/  +  l)st  time  step). 
So,  given  a  loss  vector  £,  A'  has  expected  loss  p/_1  •  £f .  But  .4  “believes”  it  is  paying  r/(p/_1 ,  p1) 
for  movement  and  p*  •  (*£f)  for  processing.  (Because  we  use  an  /  -unfair  ratio,  in  another  sense  A 
believes  it  pays  p 1  •  £  for  processing  while  its  adversary  pays  only  p'  *  ( ~£f ) .) 

We  will  show  that  the  expected  loss  to  A  is  at  most 

E [lossA\  <  J](r/( p'-’.p'Hp' •■0 
/ 

=  E[/»0(y  (y4)  +  /•  local {^t,  V  i)]  .  (6.1) 

Once  we  have  this,  we  can  let  v  be  the  action  sequence  corresponding  to  partition  P.  This  sequence 
remains  at  a  single  expert  within  each  interval  of  P,  so  that  morc(v)  <  kp  and  local (k£,  v)  =  \.Lp. 
Continuing  from  (6.1),  because  A  has  r-unfair  ratio  p,  the  expected  loss  is  at  most 

Pn.r  (more  (\)  +  local (\L  v))  +  b  <  pnj-  (ip  +  \  +  b  , 


as  the  theorem  states. 

To  show  (6.1),  consider  a  specific  trial  i1 .  The  expected  loss  to  A  is  p'-1  •  i1 .  We  bound  this  by 
f/(p,_1,  pf)  +  p'  -l1 ,  and  (6.1)  follows. 


Ep'-Y'  =  £(p'-‘-p!)f' +  !>'*! 

i  i  i 

<  £  (pP-PiXS  +  EpiC 

i:p["!>p-  1 

<  (p!_1-p  ,i)+Y^p*iti 

/:p!_1>p;  * 

=  r/(p'-3.  p')  +  p'-£' 


The  next-to-last  step  follows  because  loss  vectors  are  bounded  by  1 . 


Corollaries  to  our  conversion 

This  theorem  immediately  results  in  new  Experts  algorithms  with  approaches  very  different  from  estab¬ 
lished  multiplicative-update  algorithms  like  Thresh  and  Share.  The  first  comes  from  applying  Theorem  6. 1 
to  our  unfair  analysis  of  Marking  (Theorem  2.5). 

Corollary  6.2  For  the  Experts  problem ,  Marking  has  a  partitioning  bound  of  at  most 

(1  +s)Hn  LP+( l  +  i 


)  H„  kP  +  Hn  , 


where  s  —  p 
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Because  the  Lp  coefficient  here  approaches  Hn,  this  bound  is  much  worse  than  the  bound  provided  by  the 
multiplicative-update  algorithms  (where  the  Lp  coefficient  approaches  1). 

But  if  we  instead  use  our  r-unfair  analysis  of  Odd-Exponent  (Theorem  4.5),  we  get  a  bound  comparable 
to  that  of  Thresh  and  Share. 

Corollary  6.3  For  the  Experts  problem ,  if  we  choose  t  —  In  n,  then  Odd-Exponent  has  a  parti¬ 
tioning  bound  of  at  most 


1\  2e 

1  +  -  )  2e  In  nkp  H - h  2  , 

e  J  s 

where  e  =  —  In  n. 

r 

This  is  very  comparable  to  the  Share  bound;  the  difference  is  that  the  kp  coefficient  is  about  2e  times  what 
Share  achieves. 

At  least  some  of  this  '2e  factor  is  likely  an  artifact  of  our  analysis.  Based  on  the  /  =  1  case  (Theorem  4.2), 
we  might  suppose  that  the  2n1/*f  term  of  Theorem  4.5  is  twice  the  possible  guarantee.  But  also,  using 
Theorem  6.1  to  convert  the  MTS  unfair  competitive  ratio  to  an  Experts  partitioning  bound  can  involve 
some  loss.  This  is  illustrated  by  our  direct  analysis  of  Linear  on  two  experts. 


(1  +  s)Lp  + 


( 


6.2  Direct  analysis  of  Linear 

Of  course,  we  can  analyze  an  algorithm  directly  in  the  Experts  environment  rather  than  use  Theorem  6.1. 
We  illustrate  this  with  the  Linear  algorithm  on  two  experts. 

To  review:  The  Linear  algorithm  on  two  points  maintains  the  work  function  OPTj  and  OPT2  for  the 
two  points  and  allocates  probability 


1  OPT,  -  OPTj 

Pl  "2+  2 

to  the  first  point  and  the  remainder  to  the  second.  That  is,  Linear  moves  probability  linearly  between  experts, 
so  that  an  expert’s  probability  is  zero  when  it  is  pinned.  This  strategy  is  optimal  for  the  two-point  unfair 
MTS  problem,  achieving  a  ratio  of  r  +  1  (Theorem  4.2). 

Before  we  analyze  Linear  in  the  Experts  problem,  notice  that  if  we  use  Theorem  6.1  on  the  r-unfair 
analysis  in  Theorem  4.2,  we  get  the  following. 

Corollary  6.4  For  the  Experts  problem  with  two  experts,  Odd-Exponent  has  a  partitioning  bound 
of  at  most 

(1  +  e)LP  +  ^1  +  kP  +  —  4-  -  , 

where  e  =  -. 

r 

We  now  analyze  Linear  directly;  this  analysis  effectively  halves  the  kp  coefficient. 

Theorem  6.5  For  the  Experts  problem,  the  partitioning  bound  of  Linear  is  at  most 

(l  +  e)LP+(l  +  fj  l-kP , 


where  e  —  provided  r  is  an  integer. 
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Proof.  Consider  segment  i  of  the  partition  with  loss  L' .  Assume  without  loss  of  generality  that  the 
better  expert  for  the  segment  is  expert  1.  (So  L'  represents  the  total  loss  to  expert  1  in  the  segment.) 
Let  S  represent  the  fractional  component  of  OPT2  -  OPT)  (that  is,  S  =  (OPT2  -  OPTi)  - 
[OPT2  —  OPT]  J ).  (If  we  can  assume  the  losses  are  always  either  0  or  1,  then  the  proof  can  be 
simplified  by  ignoring  <5  (it  is  always  0)  and  ignoring  cases  2  and  -1  below  (which  occur  only  when 
Pi  or  p-i  is  0).) 

We  will  use  a  potential  function  over  this  segment  of 


,  2  l  $(i-$) 

<f>  _  rp2  +  -p-2  H - j  • 

2  47- 

Notice  that  $  is  always  between  0  and  r  +  j.  (If  OPT2  -  OPT|  =  -r  +  $  for  S  €  [0. 1],  then 
Pi  =  1  -  jj.  and  so  #  =  r  +  ^  -  <5.) 

Say  the  algorithm  receives  loss  vector  (() .  ('2).  Our  goal  is  to  show  that  the  algorithm's  cost  plus 
potential  change  is  at  most  ()  (1  +  £).  If  we  know  this,  then  the  total  cost  for  segment  /  is  at  most 
(1  +  A)L'  plus  the  maximum  potential  change  between  segments,  r  +  Thus  the  total  cost  for  the 
partition  is  at  most 


L'  +  -'+2 


I'l’  ■ 


We  can  assume  that  (fj,  (2)  is  0  in  one  of  its  components  for  the  following  reason.  Let  (  = 
min{(i ,  (’2}  and  divide  the  vector  into  two  pieces  /(.  (j  and  /(j  —  (.  (2  -  (^.  On  the  first  piece  the 

algorithm’s  cost  is  l  with  no  effect  on  probability  or  potential;  and  on  the  second  the  cost  is  (as  we 
will  show)  at  most  ((’1  -  f)(  l  +  ^).  So  for  both  pieces  the  total  cost  plus  potential  change  is  at  most 
(  +  (f  |  —  f)(l  +  A)  <  (i(l  +  ^).  We  split  the  remaining  possibilities  into  four  cases. 

Case  1:  The  vector  is  ((,  0)  and  OPT2  —  OPT]  >  —r  +  (.  Then  OPT2  —  OPT  1  increases  by 
C  and  so  p\  loses  C/2r  probability  to  p2 •  Notice  that  the  last  term  of  the  potential  function  increases 
most  when  6  is  initially  0.  The  amortized  cost,  then,  is 

Pit  +  <  Pit  +  {lht  +  ^7  +  j;-  +  ((,4/))  =  t  (l  +  ^:)  • 


Case  2:  The  vector  is  ((.0)  and  for  some  (  €  [0.  ()  we  have  OPT2  -  OPTi  =  -r  +  Then  p2 
increases  from  1  —  C/2r  to  1 ,  and  6  drops  from  ( to  0.  The  amortized  cost  is 


Pif  +  A*=  £f  + 


+  4?  ~ 


qi-0 

4r 


Case  3:  The  vector  is  (0,  f)  and  O P T 2  —  OPT]  <  r  —  L  Then  p2  loses  f/’2r  probability  to  p\ . 
The  last  term  of  the  potential  function  increases  by  at  most  (  ( 1  —  (!) /*!?•.  The  amortized  cost  is 

Pit  +  <  p2(  +  {-pit  +  0  -  £:  +  ^17^)  =  0  • 


Case  4:  The  vector  is  (0,  (!)  and  for  some  (  €  [0.  ()  we  have  OPT2  -  OPTj  =  /•  -  (1  Then  p2 
drops  from  C/'2r  to  0,  and,  because  r  is  integral,  6  drops  from  1  —  ( to  0.  The  amortized  cost  is 

p/+A4.=  i:f+(-£-i-ll#)<0. 

In  all  cases,  the  algorithm’s  cost  is  at  most  ( 1  ( 1  +  ^ ) .  ■ 
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6.3  Process  migration  experiments 

We  now  examine  some  brief  experimental  results  comparing  several  algorithms,  including  many  Experts/MTS 
algorithms,  on  data  representing  a  process  migration  problem.  Process  migration  has  aspects  of  both  the 
MTS  problem  and  the  Experts  settings.  There  is  a  cost  to  move  between  machines,  but  there  is  also  zero 
lookahead. 

For  process  migration  data,  we  collected  load  averages  collected  from  112  machines  around  the  CMU 
campus.  We  queried  each  machine  every  five  minutes  for  6.5  days.  From  these  machines,  we  selected  32 
that  were  busy  enough  to  be  interesting  for  this  analysis. 

Each  five-minute  interval  corresponds  to  a  trial  with  loss  vector  £f.  For  machine  i,  we  set  t\  =  1  if  the 
machine  had  a  large  load  average  (more  than  0.5),  and  £ \  —  0  if  it  had  a  small  load  average.  The  intent  of  this 
is  to  model  the  decision  faced  by  a  “user-friendly”  background  process  that  suspends  its  work  if  someone 
else  is  using  the  same  machine. 

We  took  the  distance  between  the  machines  to  be  0. 1 ,  indicating  that  30  seconds  of  computation  would 
be  lost  for  movement  between  machines.  In  research  process  migration  systems,  the  time  for  a  process  to 
move  is  roughly  proportional  to  its  size.  For  a  100-KB  process,  the  time  is  about  a  second  [Esk90].  Our 
distance  corresponds  to  large  but  reasonable  memory  usage. 

Our  simulations  compared  the  performance  of  nine  algorithms,  including  four  simple  control  algorithms: 

Uniform  The  algorithm  picks  a  random  machine  and  stays  there  for  all  trials. 

Greedy  After  each  trial  the  algorithm  moves  to  the  machine  that  incurred  the  least  loss  in  that  trial  (with 
ties  broken  randomly). 

Least-Used  After  each  trial  the  algorithm  moves  to  the  machine  that  has  incurred  the  least  total  loss  so 
far. 

Recent  The  algorithm  moves  to  the  machine  that  has  incurred  the  least  loss  over  the  last  k  trials. 

We  implemented  Work-Function,  Marking,  Odd-Exponent  (with  t  =  3),  Thresh,  and  Share.  (Efficiently 
implementing  Odd-Exponent  to  compensate  for  Assumption  4.1  is  a  challenge;  we  discuss  this  at  the  end 
of  this  section. 

Because  these  algorithms  have  tunable  parameters,  we  divided  the  data  into  a  training  set  and  a  test 
set,  936  trials  each.  We  optimized  parameters  on  the  training  set  and  report  the  performance  with  these 
parameters  on  the  test  set.  We  also  present  the  performance  of  each  algorithm  with  a  “naive”  parameter 
setting,  to  give  a  sense  of  the  dependence  of  the  behavior  of  the  algorithm  on  the  tuning  of  its  parameters. 

For  each  algorithm  we  determined  the  expected  loss  for  the  probability  vectors  they  calculated.  One  valid 
criticism  of  using  probabilistic  algorithms  in  practice  is  the  variance  between  runs;  so  we  also  calculated  the 
standard  deviation  over  200  trials  of  each  algorithm.  To  get  a  feel  of  how  each  algorithm  behaves,  we  finally 
computed  the  expected  number  of  moves. 

This  data  is  summarized  in  Table  6.1  where  costs  are  given  relative  to  the  optimal  off-line  sequence, 
which  suffered  a  loss  of  3.8  and  moved  8  times  in  the  test  sequence. 

We  also  tried  an  inter-machine  distance  of  1.0.  Table  6.2  summarizes  these  results.  For  an  inter-machine 
distance  of  1.0,  the  optimal  off-line  sequence  suffered  a  loss  of  11  and  moved  6  times  during  the  936  trials. 
(As  one  would  expect,  the  loss  is  higher  but  there  are  fewer  movements.) 

Comparing  these  algorithms  to  the  simpler  control  algorithms  indicates  that  their  added  sophistication 
does  indeed  help.  The  numbers  seem  to  indicate  that  the  MTS-based  algorithms  are  less  sensitive  to  pa¬ 
rameter  settings.  The  specific  experiments  summarized  here  show  that  the  MTS  algorithms  performing 
somewhat  better;  if  the  parameters  are  set  based  on  the  test  data,  this  difference  decreases. 
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algorithm 

parameter 

setting 

cost 

ratio 

std 

dev 

expected 

moves 

naive 

setting 

cost 

ratio 

Uniform 

206.69 

29.03 

0.00 

Greedy 

55.11 

4.33 

265.34 

Least-Used 

117.71 

0.00 

5.00 

Recent 

k  :  6 

17.92 

0.00 

103.00 

k  :  5 

24.37 

Work-Function 

r  :  1.0 

5.66 

0.00 

17.00 

r  :  1.0 

5.66 

Marking 

r  :  1.0 

5.97 

0.72 

20.54 

r  :  1 .0 

5.97 

Odd-Exponent 

t  :  3,  r  :  10.0 

5.96 

0.79 

15.84 

/  :  3.  r  :  1.0 

6.05 

Thresh 

/3  :  9.5  x  10-6,  o  :  10-4 

7.16 

0.66 

14.53 

\3  :  0.5.  n  :  0.01 

20.89 

Share 

f)  :  5.2  x  10~7,  a  :  10“8 

6.55 

0.63 

14.58 

;3  :  0.5,  o  :  0.0 1 

19.44 

Table  6.1:  Performance  relative  to  optimal  off-line  sequence  (d  —  0.1)  on  process  migration  data. 


algorithm 

parameter 

setting 

cost 

ratio 

std 

dev 

expected 

moves 

naive 

setting 

cost 

ratio 

Uniform 

71.40 

10.90 

0.00 

Greedy 

40.75 

2.91 

265.34 

Least-Used 

41.07 

0.00 

5.00 

Recent 

k  :  11 

6.62 

0.00 

41.00 

k  :  5 

19.71 

Work-Function 

r  :  1.0 

3.34 

0.00 

13.00 

/•  :  1 .0 

3.34 

Marking 

r  :  0.4 

3.74 

0.40 

20.54 

r  :  1 .0 

4.27 

Odd-Exponent 

1  :  3,  r  :  1.0 

3.36 

0.51 

15.84 

/  :  3. r  :  1 .0 

3.36 

Thresh 

13  :  0.027,o  :  10“8 

5.52 

0.34 

10.66 

j3  :  0.5.  n  :  0.01 

8.20 

Share 

/3  :  0.0 14.  o  :  lO-8 

5.59 

0.39 

11.56 

i3  :  0.5.  a  :  0.01 

7.68 

Table  6.2:  Performance  relative  to  optimal  off-line  sequence  (d  =  1 .0)  on  process  migration  data. 


algorithm 

competitive  ratio 

partitioning  bound 

Two-Region  ( n  =  2) 

i  r  +  1  (Th  4.2) 

(1  +  r)  Lp  +  (1  +  7)  ^kp  (Th  6.5) 

Marking 

(r  +  l)//„  (Th  2.5) 

(1  +s)HnLP  +  (1  +  i)  Hn  kP  (Cor  6.2) 

Odd-Exponent 

r  +  2c  In  n  (Th  4.5) 

(1  +  f)  Lp  +  (l  +  7)  2c  In  n  kp  (Cor  6.3) 

Thresh 

unbounded 

L’’ 33) 

Share 

r  +  6.4  In  (n(r  +  1))  +  4  (Th  3.5) 

U7dk)),‘r+  (,,5,(1,)  ^<Th  3.4) 

Table  6.3:  Summary  of  theoretical  results. 
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The  numbers  indicate  that  Work-Function  slightly  outperforms  the  randomized  algorithms,  despite  its 
worse  theoretical  guarantee.  This  is  not  too  surprising  because  a  randomized  algorithm  is  essentially  using 
its  probability  distribution  to  hedge  its  bets,  placing  probability  on  states  that  do  not  necessarily  appear 
optimal.  This  is  somewhat  analogous  to  a  stock  market,  in  which  the  main  reason  to  diversify  is  to  minimize 
the  downside  risk  more  than  to  maximize  expected  gain.  In  these  experiments,  all  the  algorithms  performed 
better  than  their  worst-case  guarantees.  In  practice,  Odd-Exponent  follows  Work-Function  very  closely, 
although  it  smooths  the  transitions  between  states. 

Implementing  Odd-Exponent 

In  an  implementation  of  Odd-Exponent,  using  OPT  values  strictly  as  defined  introduces  a  problem:  The 
algorithm  could  allocate  negative  probability  to  an  expert.  (Consider  the  case  where  expert  1  has  OPTi  =  r 
while  the  rest  are  at  zero.)  The  analysis  of  Theorem  4.8  skirts  the  issue  by  assuming  Assumption  4.1. 

If  we  wish  to  implement  Odd-Exponent,  we  must  confront  the  possibility  that  tasks  observed  will  not 
obey  this  condition.  We  can  address  this  by  using  a  modification  of  the  work  function,  OPT,  in  computing 
the  probability  distribution  of  the  strategy.  This  OPT  is  computed  as  follows.  Say  the  strategy  receives  a 
loss  vector  £.  We  will  change  OPT,;  to  become,  not  min  {OPT,  +  £t,  min  ,  OPT/  +  £j  +  r}  as  for  the 

work  function,  but  min{OPT*  +  £j,  .t},  where  x  is  the  greatest  value  such  that  no  probabilities  are  negative. 
(In  an  implementation  one  can  compute  x  by  considering  the  function  returning  the  minimum  probability 
for  a  given  x  and  using  numerical  techniques  to  find  where  this  function  reaches  zero.)  This  avoids  negative 
probabilities  because  each  probability  that  would  have  become  negative  with  the  unmodified  work  function 
becomes  zero  instead. 

This  modification  maintains  the  same  competitive  ratio  because  we  can  think  of  it  as  dividing  each  cost 

vector  into  two  pieces,  £  and  £  —  £,  where  £  —  OPT  —  OPT  .  For  £,  the  algorithm  is  competitive  with 
respect  to  the  off-line  player’s  cost  on  l  (which  itself  is  less  than  the  off-line  player’s  cost  on  £).  For  £  -  £, 
the  algorithm  will  pay  nothing,  since  the  vector  is  nonzero  only  at  states  where  OPT  =  x,  and  these  states 
have  no  probability. 
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Chapter  7 

The  unfair  paging  problem 


One  of  the  strands  running  beneath  this  thesis  is  the  usefulness  of  the  notion  of  unfairness  in  on-line  anal¬ 
ysis.  This  is  most  apparent  in  our  development  of  a  polylog  (n )  MTS  algorithm,  but  the  machine-learning 
notion  of  a  partitioning  bound  (in  the  related  but  different  Experts  problem)  is  also  actually  a  question  of 
unfairness.  What  unfairness  allows  us  to  do  is  to  build  more  sophisticated  bounds  than  a  straight  competi¬ 
tive  ratio  allows,  essentially  by  parameterizing  the  relative  importance  of  different  costs.  This  prevents  an 
algorithm  from  ignoring  one  part  of  the  costs.  For  example,  standard  algorithms  for  the  MTS  algorithm  can 
be  sloppy  with  local  costs  as  long  as  they  are  only  a  constant  factor  more  than  the  movement  cost.  Adding 
unfairness  to  the  model  forces  us  to  be  careful  with  both  aspects. 

One  can  naturally  ask  if  this  advantage  can  be  extended  to  other  problems.  In  this  chapter,  we  see  that  it 
can,  in  particular  to  the  Paging  problem. 

Problem  Paging  An  on-line  algorithm  controls  a  cache  of  k  pages  and  sees  a  sequence  of  memory 
requests  p1,  p2, . . ..  When  an  item  outside  the  current  cache  is  requested,  the  algorithm  incurs  a 
page  fault  and  must  load  the  requested  page  into  the  cache,  evicting  some  other  page  of  its  choice. 

The  goal  of  the  algorithm  is  to  minimize  the  number  of  page  faults. 

Fiat  et  al.  describe  Marking,  a  randomized  algorithm  for  Paging  (similar  to  the  eponymous  MTS 
algorithm  by  Borodin,  Linial,  and  Saks),  with  a  competitive  ratio  of  0(log  k)  [FKL+91,  BLS92],  (Fiat  et  al 
also  show  that  every  Paging  algorithm  must  have  a  competitive  ratio  of  at  least  Q  (log  n) .) 

Algorithm  Marking  ([FKL+91])  For  each  of  the  k  cache  locations,  we  have  space  for  a  mark, 
initially  empty.  When  a  page  in  the  cache  is  requested,  we  mark  its  location.  When  a  page  outside 
the  cache  is  requested,  we  pick  a  random  unmarked  location,  eject  its  page,  and  mark  the  location. 

If  all  locations  are  marked,  we  clear  the  marks  and  begin  a  new  phase. 

Theorem  7.1  ([FKL+91])  Marking  has  a  competitive  ratio  of2Hk  for  Paging. 

How  to  incorporate  unfairness  into  Paging  is  not  obvious.  Our  approach  is  the  following:  Suppose  that 
on  a  page  fault,  the  off-line  algorithm  is  allowed  the  additional  power  to  “rent”  the  requested  page  at  a  cost 
of  only  i  (think  of  r  =  log  k ),  compared  with  the  cost  of  1  for  actually  loading  the  page  into  the  cache. 
Renting  means  that  the  memory  request  is  serviced  but  the  requested  page  is  not  brought  into  the  cache  and 
the  off-line  cache  is  not  modified.  So,  for  instance,  if  the  off-line  algorithm  rents  a  page  and  then  the  same 
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page  is  requested  again,  the  off-line  algorithm  incurs  another  page  fault.  The  on-line  algorithm  has  no  such 
privilege.  (Technically,  it  is  convenient  to  allow  the  on-line  algorithm  to  rent  for  a  cost  of  1 ;  at  best,  this 
helps  the  on-line  algorithm  by  a  factor  of  two.)  The  question  we  examine  is,  what  competitive  ratio  can  be 
achieved  in  this  scenario?  This  question  can  be  thought  of  as  the  unfair  version  of  Paging,  because  we  have 
split  the  cost  into  renting  and  loading,  with  the  off-line  algorithm  having  an  unfair  advantage  on  renting. 

For  this  harder  unfair  problem,  no  algorithm  can  achieve  an  /  -unfair  competitive  ratio  less  than  r  (con¬ 
sider  a  sequence  where  each  request  is  to  a  new  page),  nor  can  any  algorithm  achieve  a  competitive  ratio 
less  than  0(log  A*).  Marking  achieves  competitive  ratio  O ( r  log  k).  We  consider  the  question  of  whether 
one  can  achieve  ratio  0(r  +  log  A  ).  The  main  result  of  this  paper  is  that  we  can,  using  Hedge  together  with 
a  notion  of  phases  similar  to  Marking. 


7.1  Motivation 

Because  the  problem  stated  above  is  not  obviously  self-motivating,  we  begin  by  presenting  two  motivations, 
one  from  paging  and  another  from  the  A-server  problem. 

Finely-competitive  paging  Request  sequences  in  practice  often  consist  of  a  core  working  set  of  frequently 
requested  pages,  together  with  occasional  assorted  memory  requests,  where  this  working  set  slowly  changes 
over  time.  Suppose  that,  in  hindsight,  the  request  sequence  can  be  partitioned  into  time  periods  containing 
working  sets  W1,  W2, . . .  .  W7"  respectively,  where  within  each  time  period  the  number  of  requests  to 
pages  outside  the  current  working  set  is  o1 .  o2, . . .  .  o"'.  Furthermore,  suppose  that  each  working  set  is 
small  enough  to  fit  within  the  memory  cache  (|W'|  <  A).  In  this  scenario,  one  off-line  strategy  in  our 
“unfair”  model  is  to  load  the  current  working  set  into  the  cache  and  to  rent  the  requests  outside  the  current 
working  set,  at  a  cost  of 

-(o1  +  ...  +  om)  +  |W]|  +  |W2  \  W]|  +  •  ••  +  |W”!  \  . 

r 

Taking  r  —  log  k,  an  algorithm  with  unfair  competitive  ratio  0(r  +  log  k)  must  pay  at  most  0(log  A  )  times 
this,  or 

O  ((o1  +  •••  +  o”’)  +  (logA-MlW1!  +  |W2  \  w1!  +  •  • .+  |W"!  \  wra-‘|))  . 

So,  if  the  sequence  involves  only  a  few  working  sets  or  if  their  differences  are  small  compared  to  the  o',  the 
on-line  algorithm  is  only  a  small  (constant)  factor  from  the  optimal  service  sequence. 

Here  is  a  simple  concrete  example.  Suppose  that  the  request  sequence  repeatedly  cycles  over  a  fixed 
set  of  k  +  1  pages.  In  that  case,  the  deterministic  LRU  algorithm  has  competitive  ratio  k  (it  faults  on 
every  request)  and  Marking  has  competitive  ratio  <9 (log  k)  (in  expectation,  it  makes  O(logA  )  page  faults 
per  cycle).  However,  our  algorithm  in  this  case  is  required  to  have  an  0(1)  ratio  because  we  can  view 
this  sequence  as  having  a  single  fixed  working  set  of  size  k,  with  one  additional  request  per  cycle.  In  other 
words,  in  the  unfair  model,  the  off-line  algorithm  could  simply  incur  a  cost  of  per  cycle  by  renting. 

In  a  sense,  this  goal  can  be  viewed  as  follows.  The  motivation  of  the  competitive  ratio  measure  itself 
is  to  allow  the  on-line  algorithm  to  perform  worse  on  “harder”  sequences  but  to  require  it  to  perform  better 
on  “easier”  ones.  Unfairness  provides  a  more  fine-grained  measure,  in  which  we  split  the  off-line  cost  into 
an  “easy”  component  (the  rentals)  and  a  “hard”  component  (the  loads).  We  require  the  algorithm  to  be 
constant-competitive  with  respect  to  the  easy  component  and  allow  an  0(log  k)  ratio  only  with  respect  to 
the  hard  component. 

Because  of  the  working  set  phenomenon,  researchers  have  tried  designing  cache  systems  that  in  a  certain 
sense  add  such  a  renting  ability.  One  practical  implementation  is  to  reserve  the  main  cache  for  the  supposed 
working  set  while  adding  a  second,  smaller  cache  of  potential  working-set  candidates  [JS97]. 
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The  A-server  problem  The  question  of  the  best  possible  competitive  ratio  for  the  A -Server  problem  of 
Manasse,  McGeoch,  and  Sleator  [MMS90]  remains  a  major  open  question. 

Problem  A  -Server  The  algorithm  is  given  a  metric  space  and  an  initial  selection  of  k  points  where 
it  has  servers.  It  faces  a  sequence  of  requests  to  points  in  the  space.  When  it  receives  a  request,  the 
algorithm  must  choose  a  server  to  move  to  the  requested  point.  The  goal  is  to  minimize  the  total 
distance  traveled  by  the  servers. 

Notice  that  a  A'-Server  instance  on  a  space  of  k  -  I  points  is  easily  modeled  as  an  MTS  problem  instance 
with  A-  +  1  points.  In  particular,  each  of  the  A-  +  1  points  corresponds  to  a  page  that  is  not  in  the  cache  — 
the  cache  holds  all  other  pages  but  the  state’s  corresponding  page. 

Koutsoupias  and  Papadimitriou’s  proof  that  the  Work-Function  algorithm  achieves  an  O  (k)  competitive 
ratio  was  a  breakthrough  result,  especially  given  the  Q,  ( k)  lower  bound  for  deterministic  algorithms  [MMS90, 
KP95].  It  is  conceivable,  however,  that  a  randomized  algorithm  could  achieve  a  poly  log  (k)  ratio.  Hope  that 
this  might  be  possible  comes  from  the  polylog (n)  MTS  result  in  Theorem  4.8.  At  the  core  of  Theorem  4.8 
is  an  algorithm  for  achieving  an  O  ( r  +  log  n )  ratio  for  the  r-unfair  MT S  problem.  Our  goal  of  O  ( r  +  log  A) 
for  r-unfair  Paging  can  be  thought  of  as  an  extension  of  the  0(r  +  log  n)  r-unfair  MTS  bound.  This  could 
potentially  be  one  step  toward  achieving  a  poly  log  (k)  bound  for  fc-Server.  (Of  course,  there  are  many 
additional  issues  involved  in  attempting  to  construct  such  a  recursive  /^-Server  algorithm.) 


7.2  A  universe  of  k  +  1  pages 


Before  we  look  at  the  general  case  where  there  can  be  arbitrarily  many  pages  requested,  we  first  restrict  our 
attention  to  the  simpler  case  where  the  request  sequence  can  only  include  one  more  page  than  can  be  held 
in  the  cache  (although  any  of  these  pages  can  of  course  be  requested  arbitrarily  many  times,  in  any  order). 
This  restricted  case  illustrates  some  of  the  ideas  that  appear  in  our  general  result. 

Because  of  the  close  relationship  of  the  ( k  +  1) -point  case  and  metrical  task  systems,  our  result  here 
can  be  seen  as  being  an  alternative  to  the  two  good  algorithms  for  the  MTS  problem  we  have  already  seen, 
Share  and  Odd-Exponent.  This  new  algorithm  is  simpler  to  describe  and  to  analyze  than  the  others,  though 
the  constants  are  slightly  worse.  It  is  a  combination  of  Marking  and  Hedge. 

Algorithm  Phased-Hedge  Each  phase  proceeds  until  every  one  of  the  k  +  1  pages  has  had  r 
requests.  At  the  beginning  of  the  phase,  we  associate  to  each  page  a  weight  ,  initialized  to  1.  The 
weights  Wi  define  a  probability  distribution  pi  =  W{/W,  where  W  =  Ylj  wj  ’■>  this  is  our  probability 
over  pages  not  to  have  in  the  cache.  (For  example,  initially  all  weights  are  1  and  so  each  page  is 
equally  likely  to  be  the  one  outside  the  cache.)  When  a  page  is  requested,  we  multiply  the  page’s 
weight  by  (3  (a  parameter  of  the  algorithm)  and  readjust  our  probability  distribution  accordingly. 
(This  effectively  increases  the  probability  that  the  page  is  in  the  cache.) 

In  the  terminology  of  the  machine  learning  literature,  we  could  think  of  having  an  “expert”  associated 
to  each  of  the  k  +  1  subsets  of  k  pages  advocating  that  the  cache  contain  these  k  pages,  and  we  could  think 
of  Phased-Hedge  as  Hedge  with  the  small  modification  that  we  reinitialize  the  algorithm  periodically  at 
phase  boundaries. 

Theorem  3.2  states  that  the  expected  loss  incurred  by  Hedge  is  at  most 


ini//? 

1-/3 


L  + 


1-/3 


In  n 


where  L  is  the  loss  of  the  best  expert  in  hindsight  and  n  is  the  number  of  experts.  In  our  context,  this  implies 
that  the  expected  cost  of  the  Phased-Hedge  algorithm  per  phase  is  at  most  l  +  (r  ln(l//3)+ln(fc+l))/(l- 
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fl).  (The  “1+”  is  the  initialization  cost  for  choosing  a  random  page  at  the  phase's  beginning.)  Now,  noting 
that  the  off-line  algorithm  must  pay  at  least  1  per  phase,  either  to  evict  a  page  or  to  rent  a  page  r  times,  we 
have  the  following  theorem. 


Theorem  7.2  The  competitive  ratio  of  the  Phased-Hedge  algorithm  for  the  r -unfair  ( k  +  1  )-page 
Paging  problem  is  at  most 


In  1//3 
1-fi 


r  + 


ln(k  +  1)  +  1  . 


For  \]  —  |,the  bound  of  Theorem  7.2  is  approximately  1.15/*+  4  In  [k  +  1)  +  L  As  fi  approaches  l,the 
bound  approaches  (1  +  §)  r  +  7  In  k  +  1  for  s  =  1  -  fi. 

For  Paging  on  more  than  k  +  1  pages,  we  extend  the  Phased-Hedge  algorithm  to  have  one  “expert” 
for  every  subset  of  pages  marked  in  the  previous  phase,  which  the  expert  predicts  should  be  kept  in  the  cache 
during  the  current  phase.  (A  page  is  marked  in  a  phase  if  it  is  requested  at  least  r  times,  and  a  phase  ends 
when  k  pages  are  marked.)  Ignoring  implementation  issues,  the  two  difficulties  that  this  approach  entails 
are  first,  that  there  are  now  many  more  experts,  and  second,  that  the  possible  cost  for  switching  between  two 
different  experts  increases  from  1  to  k .  We  deal  with  the  first  issue  by  giving  a  nonuniform  initial  weighting 
to  the  experts.  The  second  issue  involves  substantially  more  effort. 


7.3  The  general  case:  Phases  and  the  off-line  cost 

We  begin  our  analysis  of  the  general  case  by  defining  the  notion  of  “phase”  that  the  on-line  algorithm  uses 
and  proving  a  lower  bound  for  the  off-line  cost  based  on  this  notion.  Then  in  Section  7.4  we  describe  how 
the  algorithm  behaves  within  each  phase  and  prove  an  upper  bound  on  the  expected  on-line  cost.  Because 
our  on-line  algorithm  is  not  a  “lazy”  algorithm,  we  separately  analyze  its  expected  number  of  page  faults 
(the  easier  part  of  the  analysis)  and  its  expected  cost  for  modifying  its  probability  distribution  over  caches 
(the  harder  analysis).  To  define  the  initial  state  of  our  problem,  we  assume  the  cache  is  empty  before  the 
first  request  occurs. 

Like  the  Marking  algorithm,  we  divide  the  request  sequence  into  phases.  We  say  that  page  j  is  marked 
when  it  has  accumulated  at  least  r  requests  within  the  phase.  The  phase  reaches  its  end  when  any  k  pages 
become  marked. 

Let  IVT  denote  the  set  of  pages  marked  in  phase  i.  (Define  M°  to  be  the  empty  set.)  Also,  let  t\f  denote 
the  number  of  requests  to  page  j  in  phase  7.  We  define  m'  as  the  number  of  pages  marked  in  phase  /  but  not 
in  the  previous  phase  (|M?  \  M'”1 1).  Finally,  we  define  ol  as  the  total  off-line  cost  for  renting  pages  outside 
M'-1  U  M';  that  is,  o'  =  ±  £j- 

As  in  the  standard  analysis  of  Marking,  this  use  of  phases  gives  a  convenient  lower  bound  on  the  off-line 
player’s  cost. 

Lemma  7.3  If  cost  opt  (o')  is  the  optimal  off-line  cost  for  the  task  sequence,  then  we  have 

cost  opt  (a)  >  -  £  (m'  +  o1)  . 


Proof.  Consider  two  phases  i  -  1  and  i  together.  Notice  that  for  all  but  the  k  pages  in  the  off-line 
cache  at  the  beginning  of  phase  i  —  1,  the  off-line  algorithm  must  either  load  the  page  into  its  cache, 
at  a  cost  of  at  least  1,  or  service  all  requests  to  that  page  (if  any)  by  renting,  at  a  cost  of  at  least 
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(t-  1  +  £))/r.  Therefore,  any  off-line  algorithm  must  pay  at  least 


co.s/opt(®'!  1  (Tl)  >  I  mill  1 1,-i — - — - 


-k 


in  these  two  phases.  For  pages  j  marked  in  phases  i  -  1  or  i,  we  know  t-  1  +  £)  >  r;  for  other 
pages  j,  we  know  £*•  <  r  (and  so  t)/r  <  1)  since  j  is  not  marked  in  phase  i.  These  facts  imply 


Emi"  ■ 


-  k  > 


E  n  + 


t, 


E  ;)  * 


(k  +  m()  +  o?  ~  k  =  ml  +  ol  . 


Also  note  that  any  off-line  player  must  pay  at  least  m1  +  o1  in  the  first  phase.  Let  crl  represent 
the  sequence  of  requests  in  phase  i.  Then  we  get  the  following. 

2  cost  opt  (a)  >  cost  opt  (( a1cr2)(a3a 4)  •  •  •)  +  costo  pt  (o’1  (a2  a3)  (a4  a5)  ■  •  •) 

>  ((m2  +  o2)  +  (m4  +  o4)  H - )  +  ((m1  +01)  +  (m3  +  o3)  H - ) 

=  E(mi+°!)  • 


7.4  The  on-line  algorithm 

We  now  describe  a  randomized  on-line  algorithm  whose  expected  cost  in  each  phase  i  is  0{r  +  log  k)  more 
than  the  off-line  bound  of  |(m'  +  o’  )  given  in  Lemma  7.3.  To  describe  the  algorithm,  we  use  p'  to  denote 
the  probability  that  page  j  is  in  the  cache  after  servicing  the  7th  request.  For  ease  of  analysis,  our  algorithm 
may  throw  out  (invalidate)  pages  in  its  cache  even  when  there  is  no  immediate  need  to  do  so,  so  may 

be  less  than  k  for  some  times  t. 

We  divide  the  description  and  analysis  of  the  algorithm  into  two  parts.  First,  we  describe  how  the 
algorithm  determines  the  probabilities  p*-,  and  we  use  this  to  bound  the  expected  number  of  page  faults 
incurred  by  the  algorithm.  We  then  describe  how  the  algorithm  loads  and  ejects  pages  to  maintain  these 
probabilities,  and  we  bound  the  additional  cost  incurred  by  those  operations. 

The  on-line  cache  probabilities  and  expected  number  of  page  faults 

The  algorithm  determines  the  probabilities  pj  based  on  a  weighted  average  over  a  collection  of  “experts”. 
In  phase  i,  we  define  an  expert  for  each  subset  A  C  M'"1  and  give  it  an  initial  weight  of  l/kk~\AK  The 
pages  in  the  cache  for  this  “expert”  are  the  pages  in  the  set  A,  plus  up  to  the  first  k  -  |.4|  pages  not  in  M!  1 
marked  so  far.  Equivalently,  we  can  think  of  the  expert  representing  the  following  deterministic  Paging 
algorithm: 

•  Initially,  eject  all  pages  in  the  set  M  1  \  A  from  the  cache. 

•  On  a  page  fault,  rent  the  requested  page  if  any  of  the  following  hold: 
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1.  the  page  is  in  the  set  M'~3  \  A, 

2.  the  page  has  not  yet  become  marked  (it  has  received  fewer  than  r  requests  in  this  phase), 

3.  the  cache  is  full. 


•  Otherwise,  on  a  page  fault,  load  the  requested  page  into  the  cache. 

To  determine  the  probabilities  p',  we  use  the  Hedge  algorithm  to  update  experts’  weights,  and  we 
compute  a  weighted  average  of  the  experts’  caches.  Specifically,  p3'  is  the  result  of  dividing  the  total  weight 
on  experts  having  page  j  in  their  cache  by  the  total  weight  on  all  the  experts.  We  update  the  weights  on  the 
experts  as  in  Hedge  by  penalizing  them  by  a  factor  /J  =  ^  whenever  they  incur  a  page  fault.  If  we  select 
a  cache  according  to  a  distribution  matching  these  probabilities,  then  our  algorithm’s  expected  number  of 
page  faults  will  match  the  expected  cost  to  Hedge. 

One  final  addendum  to  the  algorithm:  If  m'  =  0  (i.e.,  the  pages  marked  in  this  phase  match  the  pages 
marked  from  the  previous  phase),  then  the  off-line  bound  is  o'  in  this  phase  but  some  of  the  experts  pay 
more  than  rcT  because  they  foolishly  eject  pages  from  their  cache  at  the  start  for  no  reason.  Therefore 
our  algorithm  also  expects  to  pay  more  than  rol  and  thus  is  not  competitive.  To  handle  this  problem,  our 
algorithm  simulates  the  experts  in  a  somewhat  lazy  manner.  In  particular,  if  an  expert  it  is  following  says 
to  eject  a  page  but  does  not  indicate  a  page  to  fill  that  slot,  then  the  algorithm  notes  the  recommendation 
but  does  not  evict  it  until  required.  Nonetheless,  we  define  the  probabilities  p*  as  if  we  were  immediately 
following  the  advice  of  the  experts.  The  only  case  in  which  this  turns  out  to  be  important  is  the  case  nT  =  0. 

Lemma  7.4  By  combining  these  experts  using  Hedge ,  the  on-line  algorithm  s  expected  number  of 
page  faults  in  phase  i  is  at  most  (nf  +  o')(2.8r  +  2  In  k  +  1.1). 

Proof.  The  case  nT  =  0  (when  M'  =  M'_1 )  is  a  special  case  so  we  handle  it  first.  In  this  case  we 
use  the  fact  that  our  algorithm  is  lazily  following  the  experts’  advice  and  that  for  nT  =  0,  no  expert 
will  recommend  loading  any  pages  into  the  cache.  Therefore,  the  algorithm  will  have  M'"1  =  M' 
in  its  cache  throughout  the  phase,  paying  a  total  of  rol ,  meeting  the  desired  bound.  In  the  following, 
then,  we  assume  nT  >  0. 

One  of  the  experts  will  do  quite  well,  in  particular  the  expert  with  .4  —  M'~3  D  M'.  This  expert 
“knows”  which  of  the  marked  pages  from  the  previous  phase  should  remain  for  the  current  phase, 
and  it  will  not  eject  these.  Note  that  this  expert’s  initial  weight  is  1  /ktu' . 

This  good  expert  makes  at  most  2rnT+/*o'  page  faults  in  the  phase:  For  each  of  the  A  — nT  pages 
j  G  M*  n  M1"1,  it  incurs  0  page  faults  because  j  G  A.  For  each  of  the  nT  pages  j  G  M'  \  M'-1, 
it  incurs  a  total  of  r  page  faults  until  the  page  is  finally  marked  and  brought  into  the  cache.  For  each 
of  the  nT  pages  j  G  M'-3  \  M',  the  renting  cost  is  t1-,  which  we  know  is  less  than  r  since  j  is  not 
marked  in  phase  /.  Finally,  the  expert  always  rents  pages  j  $  M'“3  U  M\  and  the  total  renting  cost 
for  these  is  ro?. 

Theorem  3.2  for  the  loss  of  the  Hedge  algorithm  can  be  generalized  to  the  case  of  experts  with 
unequal  initial  weights.  In  this  case,  the  bound  becomes 
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where  w  is  the  initial  weight  of  the  best  expert  in  hindsight  (and,  as  before,  L  is  the  loss  of  that 
expert)  and  W  is  the  sum  of  the  initial  weights.  In  our  case,  if  we  choose  i)  =  |  and  maintain 
probabilities  p*  according  to  the  expert  weights  as  above,  then  the  total  expected  number  of  page 
faults  is  at  most 

1.4(2rnT  +  rol)  +  2 In  ,  (7.2) 

wA 
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where  W  is  the  total  of  the  experts’  initial  weights  and  wa  =  1  / km  is  the  weight  for  expert  A. 
Since  for  each  m  between  1  and  k,  there  are  (*')  experts  of  weight  k~m,  the  total  weight  W  is  at 
most  Ylm= 1  <  e  —  1.  Thus  (7.2)  is  at  most 

2.8r(m!  +  o')  +  2(m' In  k  +  ln(e  —  1)) 

<  (m‘  +  o') (2.8r  +  2  In  k  +  1.1) . 


One  additional  nonobvious  fact  about  our  use  of  the  Hedge  algorithm  is  the  following. 

Lemma  7.5  If  there  is  a  request  to  page  j  at  time  t,  then  p.'+‘  a  p*  and  for  all  i  /  h  p!+1  <  Pr 

Proof  sketch.  The  easy  part  of  the  lemma  is  the  statement  that  when  a  request  is  made  to  page 
j,  the  probability  that  page  j  is  in  the  cache  increases.  That  happens  because  Hedge  penalizes  all 
experts  that  do  not  have  j  in  their  cache  and  does  not  penalize  those  that  do.  The  harder  part  is  the 
statement  about  pages  i  ~f  j\  in  particular,  perhaps  some  pages  are  correlated. 

Consider  any  fixed  m  <  k.  Let  W\  be  the  weight  on  experts  for  m-sets  containing  j  ,  Wp  be  the 
weight  on  experts  for  m-sets  not  containing  j,  H) be  the  weight  on  experts  for  7?? -sets  containing 
i  and  j,  Wpj  be  the  weight  on  experts  for  m-sets  containing  i  but  not  j.  We  want  to  show  that 

VCi,  +  Wp,i  '  Wlti  +  I3WP, 

Wi+Wp  -  Wx  +  pWp  ■ 

This  follows  if  we  can  show  W \  it  Wp  <  WpjW\ .  Let  M  be  the  set  of  pages  marked  in  the  previous 
phase.  Consider  the  instant  before  the  request,  and  let  o  be  the  number  of  requests  to  pages  outside 
M  and  (  j>  be  the  number  of  requests  to  each  page  i'  6  M.  Observe  that  the  expert  for  a  set  A  has 
accumulated  loss  o  +  f2,'eM\A  and  so  its  weight  wa  is  /30+^''eM\-4  .  The  proof  uses  this  fact 

to  show  that  each  term  on  the  left-hand  side  Wj aWp  corresponds  to  a  term  on  the  right-hand  side 
Wp,iWx.  ’  ■ 

Moving  between  probabilities 

At  any  point  in  time,  our  algorithm  maintains  a  probability  distribution  q  over  caches  (experts),  which 
induces  page  probabilities  pj  over  pages.  The  section  above  describes  one  distribution  q  using  the  Hedge 
algorithm.  Notice,  however,  that  for  the  purpose  of  computing  the  expected  number  of  page  faults  (as 
in  Lemma  7.4),  any  two  distributions  over  caches  that  induce  the  same  page  probabilities  are  equivalent. 
Therefore,  we  are  free  to  deviate  from  the  instructions  given  by  the  Hedge  algorithm  so  long  as  we  are 
faithful  to  the  page  probabilities  pj.  This  is  important  for  the  next  part  of  our  analysis,  where  we  bound  the 
expected  cost  incurred  by  moving  between  probability  distributions. 

In  particular,  we  now  examine  the  following  question.  Given  a  current  distribution  q  over  caches  that 
induces  probabilities  p3  over  pages,  and  given  a  new  target  set  of  page  probabilities  p'  that  satisfies  J2j  v'j  < 
k,  we  want  to  move  to  some  new  distribution  q'  over  caches  that  induces  //.  At  a  minimum,  any  algorithm 
must  load  an  expected  f2pi>Pj  (f)'j  —  Pj^J  number  of  pages  to  move  from  the  page  probabilities  p  to  //. 
Achieving  this  is  easily  possible  in  a  setting  where  YjjPj  —  1  (e-g->  the  case  of  k  +  1  pages  total  in  which 
pj  represents  the  probability  that  page  j  is  not  in  the  cache)  but  it  is  harder  in  our  setting,  where  ,  Pj  is  as 

large  as  k.  In  this  section,  we  show  a  method  for  achieving  an  expected  cost  of  at  most  2  [p'j  ~  P.i  J  • 

A  simple  example  will  help  illustrate  the  difficulty  and  the  algorithm.  Say  that  k  =  2  and  initially  our 
cache  is  [A,  B ]  with  probability  )  and  [C.  D]  with  probability  t,.  This  induces  page  probabilities  p\  say  we 
want  to  convert  this  to  a  new  distribution  p'  as  follows. 
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If  we  momentarily  forget  about  the  cache  capacity  of  /»*,  we  can  easily  move  to  a  new  cache  distribution 
(j  consistent  with  ;/:  we  can  simply  evict  B  with  probability  -  if  our  cache  is  [A.  B]  and  load  A  with 
probability  \  if  our  cache  is  [C,  £)].  So  q  is  the  following. 


cache  [A]  [-4,5]  [C\D]  [AX\D] 

all  I  I 

V  4  4  4  4 

The  [.4,  C\  D\  possibility,  unfortunately,  exceeds  the  size  limit  of  k  =  2.  However,  there  is  (and  there  must 
be)  a  cache  that  has  a  vacancy,  in  this  case  [.4],  We  rebalance  by  adding  page  D  to  the  small  cache  and 
evicting  D  from  the  large  cache.  This  new  cache  distribution  now  includes  only  legal  caches,  and  we  use 
this  for  </'. 

cache  [-4.5]  [-4,5]  [C\D]  [AX'] 

(/'  l  I  I  I 

7444  4 

In  other  words,  the  strategy  in  this  case  is:  “if  our  cache  is  [A,  B]  then  with  probability  f  do  nothing 
and  with  probability  -  evict  B  and  load  D;  if  our  cache  is  [C\  D]  then  with  probability  -  do  nothing  and 
with  probability  f  evict  D  and  load  .4.”  This  strategy  seems  a  bit  strange  because  p*{D)  =  p{D)  yet  we 
sometimes  evict  or  load  D,  but  this  is  necessary  in  this  situation.  As  you  can  see,  the  expected  number  of 
page  loads  in  this  example  is  which  equals  2  Ylv>  >Pj  ~  Pj )  • 

Our  strategy,  in  general,  is  as  follows.  To  move  from  a  set  of  probabilities  p  to  //,  for  any  page  j  with 
Pfj  <  Pj 9  we  evict  j  from  our  cache  (if  present)  with  probability  1  -  pf- / pr  Next,  for  pages  with  ;/  >  pn  we 
add  them  to  a  cache  not  containing  j  with  probability  (/>'  -  pj)/{  1  -  Pj).  This  gives  us  a  cache  distribution 
q  with  the  correct  probabilities  pf  and  loading  cost  V]  /  ^  (//.  —  pt),  but  it  may  create  caches  that  are  too 

ijSij  j 

large. 

Fortunately,  the  expected  number  of  pages  in  the  cache  is  YlPj  -  Thus’  if  there  are  caches  with 
more  than  k  pages,  there  must  be  caches  with  fewer  than  k  pages.  Take  a  cache  with  more  than  k  pages 
and  one  with  fewer  than  k  pages,  and  some  page  that  is  in  the  larger  but  not  the  smaller.  We  can  evict  the 
page  from  the  larger  cache  and  load  it  to  the  smaller  cache  in  such  a  way  as  to  not  change  //.  If  the  two 
caches  do  not  have  equal  probabilities,  we  cannot  immediately  reduce  the  probability  of  both  of  the  original 
caches  to  0.  However,  one  of  the  two  caches  will  end  with  probability  0,  and  thus  we  are  always  making 
discrete  progress  in  decreasing  the  total  excess  and  shortage  in  cache  sizes,  over  all  caches  with  nonzero 
probability.  Furthermore,  the  total  probability  of  performing  a  load  in  the  rebalancing  step  is  no  more  than 
the  probability  of  loading  a  page  from  in  the  increase  step,  since  each  load  required  for  a  rebalance  originates 
from  an  increased  probability.  The  expected  number  of  loads  is  no  more  than  2  J2p‘  >Pj  (p)  ~  Pj)  • 

Lemma  7.6  Given  a  probability  distribution  q  on  caches ,  this  implies  page  probabilities  p.  Given 
a  new  set  of  page  probabilities  //,  we  can  move  to  a  new  probability  distribution  q*  on  caches  with 

expected  cost  2  YLV>} >Pj  (; Pj  ~  Pj)- 


Bounding  the  on-line  movement  cost 

The  final  step  to  showing  that  our  algorithm  achieves  the  required  bound  is  to  bound  what  the  algorithm 
pays  to  load  pages  in  maintaining  the  page  probabilities  p' .  We  do  this  by  employing  Lemma  7.6  to  bound 
this  cost  in  terms  of  the  expected  number  of  page  faults  analyzed  in  Section  7.4. 
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Lemma  7.7  Using  the  movement  strategy  given  in  Lemma  7.6,  the  expected  loading  cost  for  the 
probability  sequence  used  in  Lemma  7.4  is  at  most  (m*  +  o' )  (2.8/'  +  2  In  k  +  1.1). 

Proof.  Consider  the  expert  weights  before  receiving  a  request  to  page  j.  Let  p  be  the  page  prob¬ 
abilities  before  the  request  and  p'  be  the  page  probabilities  after  the  request.  Since  j  is  the  only 
page  whose  probability  of  being  in  the  cache  increases  (Lemma  7.5),  the  expected  loading  cost  from 
Lemma  7.6  is  at  most  2  yp'-  —  p/j . 

We  want  to  bound  //.  —  pj .  Let  x  be  the  total  weight  on  experts  who  have  probability  1  on  j  and 
let  y  be  the  total  weight  on  experts  who  have  probability  0  on  j.  Since  each  expert  in  the  first  set  has 
a  loss  of  0,  the  request  will  not  alter  their  weights.  Experts  in  the  second  set,  however,  experience  a 
loss  of  1,  so  their  total  weight  decreases  to  jhj  =  y/2. 


P'j-Pj  = 


< 


x  +  y/2  x  +  y 
■Uj/2 

(x  +  y){x  +  y/2) 

1  y  ln  , 

2  x  +  y  2  Pj 


This  1  —  pj  is  exactly  the  probability  of  faulting  on  the  request.  Thus  our  expected  loading  cost  (at 
most  2  (p'j  -  pj)  is  at  most  the  expected  number  of  page  faults.  The  lemma  follows  from  the  bound 
of  Lemma  7.4.  ■ 


Bounding  the  total  expected  on-line  cost  using  Lemmas  7.4  (renting  cost)  and  7.7  (loading  cost),  and 
bounding  the  off-line  cost  using  Lemma  7.3,  we  conclude  with  our  competitive  ratio  of  0(r  +  log  k). 

Theorem  7.8  There  is  an  algorithm  whose  r -unfair  competitive  ratio  for  Paging  is  8(2.8r+21n  k  + 

1.1). 
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Chapter  8 
Conclusion 


The  metrical  task  system  problem  is  one  of  the  fundamental  on-line  problems  in  computer  science.  In 
this  thesis,  we  have  seen  how  its  applications  include  machine  learning  and  process  migration.  The  thesis 
has  neglected  to  mention  its  theoretical  applications  to  other  on-line  problems  like  robot  navigation  and  file 
migration. 

We  have  seen  how  one  can  achieve  much-improved  asymptotic  guarantees  for  metrical  task  systems. 
While  the  general-metric  result  is  not  immediately  useful  for  actual  systems,  along  the  way  we  learned 
about  algorithms  for  the  uniform  metric  that  do  have  practical  promise,  like  Share  and  Odd-Exponent. 
The  process  migration  experiment  (Section  6.3)  bolsters  the  feeling  that  these  can  be  useful  alternatives  to 

Marking. 

8.1  Themes 

On  our  way  to  achieving  improved  results,  we  have  seen  three  themes  develop  that  may  apply  to  more  on¬ 
line  analysis.  The  first  is  the  useful  relationship  between  a  fundamental  machine  learning  theory  problem, 
Experts,  and  competitive  analysis,  especially  with  the  unfair  MTS  problem.  The  Experts  results  have  much 
promise  as  important  tools  to  solving  on-line  problems;  we  have  seen  how  it  touches  on  MTS,  Combine- 
Online,  and  Paging,  but  it  is  likely  to  have  uses  elsewhere.  The  Experts  problem  deserves  to  be  included 
with  MTS  and  /r-Server  as  foundations  for  on-line  analysis  of  algorithms. 

Another  theme  of  this  thesis  is  the  use  of  unfairness  to  refine  our  on-line  goals.  Essentially,  unfairness 
gives  us  the  opportunity  to  prioritize  different  types  of  costs  by  introducing  a  trade-off  parameter.  We  have 
seen  applications  to  MTS,  Experts,  and  Paging;  in  all  cases,  the  tradeoff  has  been  between  moving  between 
selections  and  sticking  with  the  current  selection.  Whether  the  unfairness  concept  can  be  applied  naturally 
to  other  problems  remains  to  be  seen. 

Finally,  we  have  seen  the  importance  of  metric  space  approximation  in  competitive  analysis.  The 
poly  log  (n)  metrical  task  system  result  is  a  significant,  sophisticated  illustration  of  the  usefulness  of  HST 
approximation  to  competitive  analysis  and  approximation  algorithms.  Besides  being  historically  one  of  the 
first  major  results  using  Bartal’s  HST  approximation,  metrical  task  systems  are  also  likely  to  endure  as  an 
instance  where  HST  approximation  allow  us  to  do  much  better  than  we  can  without  it. 
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8.2  Open  questions 


A  number  of  open  questions,  touched  on  in  the  progress  of  this  thesis,  remain  open. 

Question  8.1  Can  Bartal’s  0(h  log  //  log  log  //)  approximation  factor  of  an  arbitrary  metric  space 
by  /f-HSTs  be  improved  to  0{h  log  //)  [Bar98]? 

Question  8.2  Is  there  a  metric  space  where  one  can  achieve  an  o(log  // )  competitive  ratio  for  MTS? 
Blum  et  al.  prove  that  for  any  algorithm  on  any  particular  space,  the  competitive  ratio  is  at  least 
$7(  y/log  ///log  log  //)  [BKRS92]. 

Question  8.3  Can  we  improve  on  the  competitive  ratio  for  the  MTS  problem  on  general  metric 
spaces?  This  thesis  proves  0(log:'  //  log  log  //)  (Theorem  4.8);  building  on  this  result,  Fiat  and 
Mendel  improve  it  to  0(log2  n  log2  log  // )  [FMOO].  Both  use  the  only  known  tractable  approach  to 
achieving  sublinear  bounds:  building  an  algorithm  for  an  HST.  This  approach  has  the  shortcoming 
that  the  metric  space  approximation  factor  will  not  improve  beyond  0(log  //),  and  the  competitive 
ratio  for  the  HST  will  not  improve  beyond  0( log  //),  giving  an  inherent  limit  of  0(log2  //). 

Question  8.4  We  have  seen  a  number  of  algorithms  for  the  /  -unfair  MTS  problem  on  a  uniform 
metric,  the  best  bound  being  r  +  2c  In  ?/  achieved  by  Odd-Exponent.  Can  one  get  an  r  +  In  ?/ 
algorithm  for  this  problem?  And  is  there  an  intuitive  explanation  for  why  Odd-Exponent,  with  its 
peculiar  structure,  does  so  well? 

Question  8.5  Example  5.2  shows  that  one  can  get  arbitrarily  close  to  a  static  adversary’s  per¬ 
formance  for  both  List-Update  and  Dynamic-Tree,  but  the  algorithms  to  do  this  are  massively 
inefficient.  Are  there  efficient  algorithms  to  do  the  same? 

Question  8.6  For  the  Bandits  problem  with  a  switching  cost.  Corollary  5.4  shows  an  algorithm 
that  is  an  additive  0(\/dnT2  In  /?)  from  the  gain  of  the  best  bandit.  Can  this  be  improved  to 
0(s/dnT\n  ?/),  as  Auer  et  al.  achieve  for  the  problem  with  no  switching  cost  [ACBFS98]? 

Question  8.7  The  paging  algorithm  of  Chapter  7  requires  exponential  running  time.  Is  there  an 
efficient  method  achieving  the  same  0  ( r  +  log  ■// )  guarantee? 

Question  8.8  Can  one  achieve  a  guarantee  of  r  +  0(log»)  for  /‘-unfair  Paging?  Or  perhaps 
(1  +  £•)/■  +  0(7  log  //)?  And  can  such  an  algorithm  for  the  unfair  scenario  be  used  for  /.  -Server  on 
an  HST  space?  We  were  able  to  abstract  lower  levels  for  MTS,  but  determining  the  proper  way  to  do 
this  for  /.-Server  is  a  challenging  problem.  For  instance,  it  appears  that  such  an  abstraction  would 
have  to  to  encourage  multiple  servers  to  be  at  a  single  point  in  the  uniform  space. 

Question  8.9  For  that  matter,  is  there  any  way  of  using  randomization  to  improve  the  2k  -  1  ratio 
for  /.  -Server  achieved  by  Koutsoupias  and  Papadimitrou  [KP95]?  The  conjecture  is  that  0(log  k) 
is  possible,  but  we  appear  very  far  from  any  sublinear  guarantee. 
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